The main focus of our analysis will be the evaluation of Lifetime Value (LTV) - a metric that helps us understand the average profit a single customer brings over the entire period of cooperation with us. This will help us better understand how much we can spend on attracting new customers without making a loss.
We will also assess the quality of our advertising channels by determining which ones bring us the most valuable customers. This will allow us to optimize our marketing expenses and increase the Return on Investment (ROI).
Customer Retention assessment will also be an important part of our analysis. We want to know how well we retain our customers and what factors influence their loyalty.
Using RFM (Recency, Frequency, Monetary) analysis, we will segment our customers to better understand their behavior and needs. This will help us develop more effective marketing and sales strategies.
Finally, we will attempt to predict revenues for the next year based on our past data. This will help us better plan our budgets and strategies for the future.
We hope that this analysis will help us improve our work and increase our profits. Let's get started.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
import matplotlib as mpl
from matplotlib import cm
import statsmodels.stats.api as sms
from scipy.stats import mode
import matplotlib.colors as mcolors
import plotly.graph_objs as go
from plotly.offline import iplot
from plotly.subplots import make_subplots
from operator import attrgetter
from sklearn.metrics import (silhouette_score, calinski_harabasz_score,
davies_bouldin_score, silhouette_samples)
from lifetimes import BetaGeoFitter, GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans, DBSCAN
import os
import warnings
# import pylab
# import squarify
# import time
# from datetime import date, datetime, timedelta as dt
warnings.filterwarnings('ignore')
%load_ext nb_black
sns.set(rc={'axes.facecolor':'#FFFFFF'}, palette='Set1') # graph facecolor #e6e6fa
font_color = '#525252'
csfont = {'fontname':'Georgia'} # title font
hfont = {'fontname':'Calibri'} # main font
annot_color = '#525252' # annotation color
c1 = '#264450' #color bar 1
c2 = '#465945' #color bar 2
c3 = '#e05b4b' #color bar 3 red
os.listdir('.')
['CRM_LTV_RFM.ipynb', 'customers.csv', 'data.csv', 'newplot.png', 'newplot2.png', 'newplot_map.png', 'orders.csv', 'prep_LTV.ipynb']
orders = pd.read_csv('orders.csv', encoding = 'unicode_escape',
dtype = {'order_id': str},
parse_dates = ['invoice_date'],
infer_datetime_format = True,
usecols=range(1, 6))
customers = pd.read_csv('customers.csv', encoding = 'unicode_escape',
dtype = {'customer_id': str, 'order_id': str},
usecols=range(1, 6))
HOME || Introduction
Below is the description of the data, there are a total of 5 columns and 541909 rows:
| Data columns | Purpose | Types |
|---|---|---|
| order_id | Unique purchase number | object |
| vendor_code | Product code | object |
| quantity | Quantity of goods | int64 |
| invoice_date | Purchase invoice date | datetime64[ns] |
| unit_price | Cost of 1 unit of goods, in $. | float64 |
def check_data(df, head=6):
print(" INFO ".center(80,'-'))
print('Shape of the training dataset:')
print('There are {} rows and {} columns in dataset.\n'.format(df.shape[0], df.shape[1]))
print(" TOP ".center(80,'-'))
print(df.head(head))
print(" TAIL ".center(80,'-'))
print(df.tail(head))
print(" MISSING VALUES ".center(80,'-'))
print(f'\nAvailability of NaN values >>> {orders.isnull().sum().any()}\n')
print(df.isnull().sum())
print(" DUPLICATED ROWS ".center(80,'-'))
print(f'\nNumber of completely identical rows >>> {df.duplicated().sum()}\n')
print(" UNIQUE VALUES ".center(80,'-'))
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
columns=['Column_name', 'Nums_unique'])\
.sort_values(by=['Nums_unique'], ascending=False)
print(unique_counts)
print(" DESCRIBE ".center(80,'-'))
print(df.describe())
check_data(orders)
------------------------------------- INFO -------------------------------------
Shape of the training dataset:
There are 541909 rows and 5 columns in dataset.
------------------------------------- TOP --------------------------------------
order_id vendor_code quantity invoice_date unit_price
0 536365 85123A 6 2022-12-01 08:26:00 2.55
1 536365 71053 6 2022-12-01 08:26:00 3.39
2 536365 84406B 8 2022-12-01 08:26:00 2.75
3 536365 84029G 6 2022-12-01 08:26:00 3.39
4 536365 84029E 6 2022-12-01 08:26:00 3.39
5 536365 22752 2 2022-12-01 08:26:00 7.65
------------------------------------- TAIL -------------------------------------
order_id vendor_code quantity invoice_date unit_price
541903 581587 23256 4 2022-12-09 12:50:00 4.15
541904 581587 22613 12 2022-12-09 12:50:00 0.85
541905 581587 22899 6 2022-12-09 12:50:00 2.10
541906 581587 23254 4 2022-12-09 12:50:00 4.15
541907 581587 23255 4 2022-12-09 12:50:00 4.15
541908 581587 22138 3 2022-12-09 12:50:00 4.95
-------------------------------- MISSING VALUES --------------------------------
Availability of NaN values >>> False
order_id 0
vendor_code 0
quantity 0
invoice_date 0
unit_price 0
dtype: int64
------------------------------- DUPLICATED ROWS --------------------------------
Number of completely identical rows >>> 5270
-------------------------------- UNIQUE VALUES ---------------------------------
Column_name Nums_unique
0 order_id 25900
3 invoice_date 23104
1 vendor_code 4070
4 unit_price 1630
2 quantity 722
----------------------------------- DESCRIBE -----------------------------------
quantity unit_price
count 541909.000000 541909.000000
mean 9.552250 4.611114
std 218.081158 96.759853
min -80995.000000 -11062.060000
25% 1.000000 1.250000
50% 3.000000 2.080000
75% 10.000000 4.130000
max 80995.000000 38970.000000
Below is the description of the data, there are a total of 5 columns and 25900 rows:
| Data columns | Purpose | Types |
|---|---|---|
| order_id | Unique purchase number | object |
| customer_id | Unique client number | object |
| country | Country from which the order was made | object |
| channel | Channel where the customer came from: organic - organic search engine results or direct access to the site, fb_ads - Facebook Ads, g_ads_s - Advertising on Google search results, g_ads_b - Google banner advertising, remarketing, other_banner - Other banner advertising on partner sites |
object |
| cost | Cost of advertising, transition, in $. | float64 |
check_data(customers)
------------------------------------- INFO -------------------------------------
Shape of the training dataset:
There are 25900 rows and 5 columns in dataset.
------------------------------------- TOP --------------------------------------
order_id customer_id country channel cost
0 536365 17850 United Kingdom organic 0.00
1 536366 17850 United Kingdom fb_ads 1.59
2 536367 13047 USA organic 0.00
3 536368 13047 USA g_ads_s 2.53
4 536369 13047 USA other_banner 0.40
5 536370 12583 France g_ads_s 1.43
------------------------------------- TAIL -------------------------------------
order_id customer_id country channel cost
25894 C581470 17924 Netherlands fb_ads 0.22
25895 C581484 16446 Germany organic 0.00
25896 C581490 14397 USA g_ads_b 0.08
25897 C581499 15498 Belgium g_ads_s 3.34
25898 C581568 15311 Germany g_ads_b 0.06
25899 C581569 17315 United Kingdom g_ads_b 0.06
-------------------------------- MISSING VALUES --------------------------------
Availability of NaN values >>> False
order_id 0
customer_id 0
country 0
channel 0
cost 0
dtype: int64
------------------------------- DUPLICATED ROWS --------------------------------
Number of completely identical rows >>> 0
-------------------------------- UNIQUE VALUES ---------------------------------
Column_name Nums_unique
0 order_id 25900
1 customer_id 6041
4 cost 342
2 country 38
3 channel 5
----------------------------------- DESCRIBE -----------------------------------
cost
count 25900.000000
mean 0.533727
std 0.767301
min 0.000000
25% 0.000000
50% 0.140000
75% 0.792500
max 3.470000
df = pd.merge(orders, customers, on='order_id')
check_data(df)
------------------------------------- INFO -------------------------------------
Shape of the training dataset:
There are 541909 rows and 9 columns in dataset.
------------------------------------- TOP --------------------------------------
order_id vendor_code quantity invoice_date unit_price customer_id \
0 536365 85123A 6 2022-12-01 08:26:00 2.55 17850
1 536365 71053 6 2022-12-01 08:26:00 3.39 17850
2 536365 84406B 8 2022-12-01 08:26:00 2.75 17850
3 536365 84029G 6 2022-12-01 08:26:00 3.39 17850
4 536365 84029E 6 2022-12-01 08:26:00 3.39 17850
5 536365 22752 2 2022-12-01 08:26:00 7.65 17850
country channel cost
0 United Kingdom organic 0.0
1 United Kingdom organic 0.0
2 United Kingdom organic 0.0
3 United Kingdom organic 0.0
4 United Kingdom organic 0.0
5 United Kingdom organic 0.0
------------------------------------- TAIL -------------------------------------
order_id vendor_code quantity invoice_date unit_price \
541903 581587 23256 4 2022-12-09 12:50:00 4.15
541904 581587 22613 12 2022-12-09 12:50:00 0.85
541905 581587 22899 6 2022-12-09 12:50:00 2.10
541906 581587 23254 4 2022-12-09 12:50:00 4.15
541907 581587 23255 4 2022-12-09 12:50:00 4.15
541908 581587 22138 3 2022-12-09 12:50:00 4.95
customer_id country channel cost
541903 12680 France other_banner 0.74
541904 12680 France other_banner 0.74
541905 12680 France other_banner 0.74
541906 12680 France other_banner 0.74
541907 12680 France other_banner 0.74
541908 12680 France other_banner 0.74
-------------------------------- MISSING VALUES --------------------------------
Availability of NaN values >>> False
order_id 0
vendor_code 0
quantity 0
invoice_date 0
unit_price 0
customer_id 0
country 0
channel 0
cost 0
dtype: int64
------------------------------- DUPLICATED ROWS --------------------------------
Number of completely identical rows >>> 5270
-------------------------------- UNIQUE VALUES ---------------------------------
Column_name Nums_unique
0 order_id 25900
3 invoice_date 23104
5 customer_id 6041
1 vendor_code 4070
4 unit_price 1630
2 quantity 722
8 cost 342
6 country 38
7 channel 5
----------------------------------- DESCRIBE -----------------------------------
quantity unit_price cost
count 541909.000000 541909.000000 541909.000000
mean 9.552250 4.611114 0.537058
std 218.081158 96.759853 0.778916
min -80995.000000 -11062.060000 0.000000
25% 1.000000 1.250000 0.000000
50% 3.000000 2.080000 0.130000
75% 10.000000 4.130000 0.790000
max 80995.000000 38970.000000 3.470000
Let's get duplicate rows - 5270
# Getting duplicates
df_dupl = df[df.duplicated(keep=False)]
df_dupl_sorted = df_dupl.sort_values(by=['order_id', 'vendor_code'])
df_dupl_sorted.head(10)
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 494 | 536409 | 21866 | 1 | 2022-12-01 11:45:00 | 1.25 | 17908 | United Kingdom | fb_ads | 1.12 |
| 517 | 536409 | 21866 | 1 | 2022-12-01 11:45:00 | 1.25 | 17908 | United Kingdom | fb_ads | 1.12 |
| 485 | 536409 | 22111 | 1 | 2022-12-01 11:45:00 | 4.95 | 17908 | United Kingdom | fb_ads | 1.12 |
| 539 | 536409 | 22111 | 1 | 2022-12-01 11:45:00 | 4.95 | 17908 | United Kingdom | fb_ads | 1.12 |
| 489 | 536409 | 22866 | 1 | 2022-12-01 11:45:00 | 2.10 | 17908 | United Kingdom | fb_ads | 1.12 |
| 527 | 536409 | 22866 | 1 | 2022-12-01 11:45:00 | 2.10 | 17908 | United Kingdom | fb_ads | 1.12 |
| 521 | 536409 | 22900 | 1 | 2022-12-01 11:45:00 | 2.95 | 17908 | United Kingdom | fb_ads | 1.12 |
| 537 | 536409 | 22900 | 1 | 2022-12-01 11:45:00 | 2.95 | 17908 | United Kingdom | fb_ads | 1.12 |
| 565 | 536412 | 21448 | 2 | 2022-12-01 11:49:00 | 1.65 | 17920 | Australia | other_banner | 0.54 |
| 578 | 536412 | 21448 | 1 | 2022-12-01 11:49:00 | 1.65 | 17920 | Australia | other_banner | 0.54 |
Removing duplicates
df = df.drop_duplicates()
Saving a basic copy of the dataset
df_basic = df.copy(deep=True)
print(f'Minimum date >>> {df.invoice_date.min()}')
print(f'Maximum date >>> {df.invoice_date.max()}')
period = df.invoice_date.max() - df.invoice_date.min()
print(f'Dataset time period >>> {period}')
Minimum date >>> 2022-01-04 10:00:00 Maximum date >>> 2022-12-23 17:41:00 Dataset time period >>> 353 days 07:41:00
customers_by_country = df.groupby('country')['customer_id'].nunique()
grouped_data = df[['customer_id', 'order_id', 'country']]
grouped_data = grouped_data.groupby(['order_id', 'country']).count()
grouped_data = grouped_data.reset_index(drop=False)
countries = grouped_data['country'].value_counts()
text_labels = [f'unique customers - {customers_by_country[country]}' for country in countries.index]
choropleth_dict = dict(
type='choropleth',
locations=countries.index,
locationmode='country names',
z=countries,
text=text_labels,
colorbar={'title': 'Orders'},
colorscale=[[0, 'green'], [0.1, 'yellow'], [1, 'red']],
reversescale=False
)
layout_dict = dict(
title={
'text': "Number of orders by countries",
'y': 0.9,
'x': 0.5,
'xanchor': 'center',
'yanchor': 'top'
},
geo=dict(
resolution=50,
showocean=True,
oceancolor="LightBlue",
showland=True,
landcolor="whitesmoke",
showframe=True
),
template='plotly_white',
height=600,
width=1000
)
choromap = go.Figure(data=[choropleth_dict], layout=layout_dict)
iplot(choromap, validate=False)
# The customers_and_orders_by_country table provides an aggregated overview
# of the data, where the number of unique customers (customer_id)
# and the number of unique orders (order_id) are counted for each country.
customers_and_orders_by_country = df.groupby('country').agg({
'customer_id': pd.Series.nunique,
'order_id': pd.Series.nunique
}).sort_values(by=['customer_id', 'order_id'], ascending=False)
customers_and_orders_by_country
| customer_id | order_id | |
|---|---|---|
| country | ||
| USA | 1798 | 7162 |
| United Kingdom | 1515 | 6577 |
| Germany | 584 | 2662 |
| Canada | 290 | 1461 |
| France | 133 | 626 |
| EIRE | 81 | 529 |
| Spain | 78 | 316 |
| Belgium | 70 | 328 |
| Austria | 70 | 221 |
| Finland | 65 | 384 |
| Hong Kong | 63 | 199 |
| Switzerland | 62 | 188 |
| Greece | 60 | 299 |
| Netherlands | 58 | 306 |
| Israel | 58 | 226 |
| Japan | 58 | 178 |
| Poland | 57 | 224 |
| Norway | 57 | 195 |
| RSA | 57 | 187 |
| Denmark | 56 | 268 |
| Singapore | 54 | 222 |
| Lebanon | 54 | 157 |
| Sweden | 52 | 218 |
| Italy | 52 | 167 |
| Lithuania | 51 | 381 |
| Portugal | 51 | 209 |
| Cyprus | 51 | 186 |
| Czech Republic | 49 | 287 |
| Channel Islands | 48 | 206 |
| United Arab Emirates | 46 | 178 |
| Bahrain | 46 | 150 |
| Australia | 45 | 216 |
| Brazil | 44 | 162 |
| Iceland | 44 | 161 |
| Unspecified | 43 | 110 |
| Malta | 41 | 120 |
| Saudi Arabia | 38 | 125 |
| European Community | 32 | 109 |
# We calculate the average number of orders per client for each country
average_orders_per_customer = df.groupby('country')['order_id'].nunique() / df.groupby('country')['customer_id'].nunique()
customers_and_orders_by_country['average_orders_per_customer'] = average_orders_per_customer
customers_and_orders_by_country.rename(columns={'average_orders_per_customer':'AVG_orders_per_customer',
'customer_id':'customers', 'order_id':'orders'}) \
.sort_values(by='AVG_orders_per_customer', ascending=False)
| customers | orders | AVG_orders_per_customer | |
|---|---|---|---|
| country | |||
| Lithuania | 51 | 381 | 7.470588 |
| EIRE | 81 | 529 | 6.530864 |
| Finland | 65 | 384 | 5.907692 |
| Czech Republic | 49 | 287 | 5.857143 |
| Netherlands | 58 | 306 | 5.275862 |
| Canada | 290 | 1461 | 5.037931 |
| Greece | 60 | 299 | 4.983333 |
| Australia | 45 | 216 | 4.800000 |
| Denmark | 56 | 268 | 4.785714 |
| France | 133 | 626 | 4.706767 |
| Belgium | 70 | 328 | 4.685714 |
| Germany | 584 | 2662 | 4.558219 |
| United Kingdom | 1515 | 6577 | 4.341254 |
| Channel Islands | 48 | 206 | 4.291667 |
| Sweden | 52 | 218 | 4.192308 |
| Singapore | 54 | 222 | 4.111111 |
| Portugal | 51 | 209 | 4.098039 |
| Spain | 78 | 316 | 4.051282 |
| USA | 1798 | 7162 | 3.983315 |
| Poland | 57 | 224 | 3.929825 |
| Israel | 58 | 226 | 3.896552 |
| United Arab Emirates | 46 | 178 | 3.869565 |
| Brazil | 44 | 162 | 3.681818 |
| Iceland | 44 | 161 | 3.659091 |
| Cyprus | 51 | 186 | 3.647059 |
| Norway | 57 | 195 | 3.421053 |
| European Community | 32 | 109 | 3.406250 |
| Saudi Arabia | 38 | 125 | 3.289474 |
| RSA | 57 | 187 | 3.280702 |
| Bahrain | 46 | 150 | 3.260870 |
| Italy | 52 | 167 | 3.211538 |
| Hong Kong | 63 | 199 | 3.158730 |
| Austria | 70 | 221 | 3.157143 |
| Japan | 58 | 178 | 3.068966 |
| Switzerland | 62 | 188 | 3.032258 |
| Malta | 41 | 120 | 2.926829 |
| Lebanon | 54 | 157 | 2.907407 |
| Unspecified | 43 | 110 | 2.558140 |
customers_and_orders_by_country = customers_and_orders_by_country.sort_values(by='customer_id', ascending=True)
fig = plt.figure()
fig.set_figwidth(7)
fig.set_figheight(9)
ax = fig.add_axes([0.1,0.1,0.9,0.9])
bar0 = ax.barh(customers_and_orders_by_country.index,
customers_and_orders_by_country['customer_id'],
label='Customer ID', color = c1, alpha=1, height = 0.9)
bar = ax.barh(customers_and_orders_by_country.index,
customers_and_orders_by_country['order_id'],
label='Order ID', color = c2, alpha=0.8, height = 0.5, linewidth=0.5)
ax.set_xlim([0, 700])
ax2 = ax.twiny()
average_orders_per_customer_rounded = customers_and_orders_by_country['average_orders_per_customer'].round(1)
ax2.scatter(average_orders_per_customer_rounded, customers_and_orders_by_country.index, color='red', label='Average Orders per Customer')
for i, txt in enumerate(average_orders_per_customer_rounded):
ax2.annotate(txt, (average_orders_per_customer_rounded[i], customers_and_orders_by_country.index[i]))
fig.legend(loc="center right")
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.yaxis.set_tick_params(rotation=0, labelsize=14)
plt.title('Customers and Orders by Country', fontsize=20, color=font_color, **csfont)
plt.show()
- The countries with the highest number of unique customers and orders are the USA and the United Kingdom. This may indicate that these markets are primary for your business
- In some countries, such as Lithuania and EIRE, the number of unique customers is relatively small, but the average number of orders per customer is higher. This may indicate that customers from these countries make more purchases on average
- It's important to note that the number of orders is not always directly proportional to the number of customers. This can be related to various factors such as repeat purchases, order size, etc.
- The average number of orders per customer can provide additional insights into customer behavior. For example, a high average might indicate a high level of customer loyalty, while a low average might suggest that many customers make only one purchase
- For a deeper analysis, you can consider other metrics that can help better understand customer behavior, such as the total revenue per customer, the average order value, or the frequency of repeat purchases
Let's look at the data where the 'quantity' is less than or equal to 0.
df_=df[df['quantity']<=0]
print(f'Number of data rows: {df_.shape[0]}')
df_.head(5)
Number of data rows: 10587
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 141 | C536379 | D | -1 | 2022-12-01 09:41:00 | 27.50 | 14527 | USA | fb_ads | 0.18 |
| 154 | C536383 | 35004C | -1 | 2022-12-01 09:49:00 | 4.65 | 15311 | Germany | other_banner | 0.36 |
| 235 | C536391 | 22556 | -12 | 2022-12-01 10:24:00 | 1.65 | 17548 | Israel | other_banner | 0.61 |
| 236 | C536391 | 21984 | -24 | 2022-12-01 10:24:00 | 0.29 | 17548 | Israel | other_banner | 0.61 |
| 237 | C536391 | 21983 | -24 | 2022-12-01 10:24:00 | 0.29 | 17548 | Israel | other_banner | 0.61 |
# Canceled orders
df_C = df[df.order_id.str.startswith('C')]
# df_C = df[df['order_id'].str.contains('C', na = False)]
print(f'Number of data rows: {df_C.shape[0]}')
df_C.head(5)
Number of data rows: 9251
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 141 | C536379 | D | -1 | 2022-12-01 09:41:00 | 27.50 | 14527 | USA | fb_ads | 0.18 |
| 154 | C536383 | 35004C | -1 | 2022-12-01 09:49:00 | 4.65 | 15311 | Germany | other_banner | 0.36 |
| 235 | C536391 | 22556 | -12 | 2022-12-01 10:24:00 | 1.65 | 17548 | Israel | other_banner | 0.61 |
| 236 | C536391 | 21984 | -24 | 2022-12-01 10:24:00 | 0.29 | 17548 | Israel | other_banner | 0.61 |
| 237 | C536391 | 21983 | -24 | 2022-12-01 10:24:00 | 0.29 | 17548 | Israel | other_banner | 0.61 |
Let's look at the data where the 'unit_price' is less than or equal to 0.
df_u=df[df['unit_price']<0]
print(f'Number of data rows: {df_u.shape[0]}')
df_u.head()
Number of data rows: 2
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 299983 | A563186 | B | 1 | 2022-08-12 14:51:00 | -11062.06 | 12903 | USA | g_ads_s | 0.18 |
| 299984 | A563187 | B | 1 | 2022-08-12 14:52:00 | -11062.06 | 15617 | United Kingdom | g_ads_s | 0.18 |
df_u=df[df['unit_price']==0]
print(f'Number of data rows: {df_u.shape[0]}')
df_u.head()
Number of data rows: 2510
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 622 | 536414 | 22139 | 56 | 2022-12-01 11:52:00 | 0.0 | 15409 | United Kingdom | g_ads_s | 1.03 |
| 1970 | 536545 | 21134 | 1 | 2022-12-01 14:32:00 | 0.0 | 18060 | Germany | organic | 0.00 |
| 1971 | 536546 | 22145 | 1 | 2022-12-01 14:33:00 | 0.0 | 12533 | USA | organic | 0.00 |
| 1972 | 536547 | 37509 | 1 | 2022-12-01 14:33:00 | 0.0 | 12016 | Czech Republic | organic | 0.00 |
| 1987 | 536549 | 85226A | 1 | 2022-12-01 14:34:00 | 0.0 | 17077 | United Kingdom | organic | 0.00 |
- Definitely, we need to remove from the data orders in which quantity is less than 0. Also remove data where unite_price is less than 0
- What to do with the data (2510 rows), where unite_price equals = 0?
Let's take into account that products with unite_price = 0 are free products (promotional).
# We calculate number of orders and client for each country
# where ['quantity']<=0 or ['unit_price']<0
df_cancel = df[(df['quantity']<=0) | (df['unit_price']<0)]
cancel_co_by_country = df_cancel.groupby('country').agg({
'customer_id': pd.Series.nunique,
'order_id': pd.Series.nunique
}).sort_values(by=['customer_id', 'order_id'], ascending=False)
# Add columns with percentage losses and round to 1 decimal place
cancel_co_by_country['customer_id_loss_pct'] = (cancel_co_by_country['customer_id'] / customers_and_orders_by_country['customer_id'] * 100).round(1)
cancel_co_by_country['order_id_loss_pct'] = (cancel_co_by_country['order_id'] / customers_and_orders_by_country['order_id'] * 100).round(1)
cancel_co_by_country
| customer_id | order_id | customer_id_loss_pct | order_id_loss_pct | |
|---|---|---|---|---|
| country | ||||
| USA | 777 | 1429 | 43.2 | 20.0 |
| United Kingdom | 670 | 1321 | 44.2 | 20.1 |
| Germany | 264 | 532 | 45.2 | 20.0 |
| Canada | 140 | 300 | 48.3 | 20.5 |
| France | 53 | 113 | 39.8 | 18.1 |
| Finland | 36 | 93 | 55.4 | 24.2 |
| Czech Republic | 31 | 67 | 63.3 | 23.3 |
| EIRE | 30 | 98 | 37.0 | 18.5 |
| Spain | 30 | 50 | 38.5 | 15.8 |
| Austria | 29 | 49 | 41.4 | 22.2 |
| Switzerland | 29 | 42 | 46.8 | 22.3 |
| Belgium | 27 | 58 | 38.6 | 17.7 |
| Greece | 26 | 65 | 43.3 | 21.7 |
| Lithuania | 26 | 49 | 51.0 | 12.9 |
| RSA | 26 | 36 | 45.6 | 19.3 |
| Israel | 25 | 48 | 43.1 | 21.2 |
| Netherlands | 24 | 50 | 41.4 | 16.3 |
| Singapore | 24 | 50 | 44.4 | 22.5 |
| Channel Islands | 24 | 42 | 50.0 | 20.4 |
| United Arab Emirates | 23 | 39 | 50.0 | 21.9 |
| Lebanon | 23 | 31 | 42.6 | 19.7 |
| Poland | 22 | 48 | 38.6 | 21.4 |
| Cyprus | 22 | 42 | 43.1 | 22.6 |
| Norway | 21 | 48 | 36.8 | 24.6 |
| Japan | 21 | 42 | 36.2 | 23.6 |
| Denmark | 20 | 48 | 35.7 | 17.9 |
| Sweden | 20 | 38 | 38.5 | 17.4 |
| Italy | 19 | 46 | 36.5 | 27.5 |
| Portugal | 19 | 33 | 37.3 | 15.8 |
| Hong Kong | 19 | 30 | 30.2 | 15.1 |
| Australia | 18 | 41 | 40.0 | 19.0 |
| Saudi Arabia | 18 | 34 | 47.4 | 27.2 |
| European Community | 17 | 31 | 53.1 | 28.4 |
| Bahrain | 17 | 22 | 37.0 | 14.7 |
| Iceland | 15 | 32 | 34.1 | 19.9 |
| Unspecified | 15 | 22 | 34.9 | 20.0 |
| Malta | 14 | 26 | 34.1 | 21.7 |
| Brazil | 11 | 29 | 25.0 | 17.9 |
def barh_plot(data, x1, x2, title, label1, label2, sortby = None, color1 = c1, color2 = c2, font_color = font_color, csfont = csfont):
data = data.sort_values(by=sortby, ascending=True)
fig, ax = plt.subplots(figsize=(8, 12))
bar_width = 0.6
bar1 = ax.barh(data.index, data[x1],
color=color1, alpha=1, height=bar_width, align='edge', label=label1)
bar2 = ax.barh(data.index, -data[x2],
color=color2, alpha=1, height=bar_width, align='edge', label=label2)
for i in ax.patches:
if i.get_width() > 0:
ax.text(i.get_width()+0.2, i.get_y() + bar_width / 2,
str(round((i.get_width()), 1))+'%',
fontsize=11, fontweight='bold',
color='grey', va='center')
else:
ax.text(i.get_width()+0.2, i.get_y() + bar_width / 2,
str(round((-i.get_width()), 1))+'%',
fontsize=11, fontweight='bold',
color='white', va='center')
ax.legend(loc='upper center')
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_frame_on(False)
ax.tick_params(left=False, bottom=False, labelbottom=False)
ax.set_yticks(np.arange(len(data.index)) + bar_width / 2, minor=False)
ax.set_yticklabels(data.index, minor=True)
plt.yticks(fontsize=14)
plt.title(title, fontsize=20, color=font_color, **csfont)
plt.show()
barh_plot(cancel_co_by_country, x1 = 'customer_id_loss_pct', x2 = 'order_id_loss_pct',
title = 'Loss Percentage by Country', label1='Customer ID Loss %',
label2='Order ID Loss %', sortby='customer_id_loss_pct')
def level_bar(dfs, level1, level2, title, width=650, height=800, h=+1.05, font_color='#525252', csfont={'fontname':'Georgia'}):
fig = make_subplots(specs=[[{"secondary_y": False}, {"secondary_y": True}]],
horizontal_spacing=0,
shared_yaxes=False,
rows=1, cols=2, print_grid=True )
dfs.sort_values(by=level1, ascending=True, inplace=True)
x1 = dfs[level1]
y1 = dfs.index
fig.add_trace(go.Bar(orientation ='h',
x=x1, y=y1, name=level1, text=x1, textposition='inside',
textfont=dict(size=12),
marker_color=c1), 1, 1)
dfs.sort_values(by=level2, ascending=True, inplace=True)
x2 = dfs[level2]
y2 = dfs.index
fig.add_trace(go.Bar(orientation ='h',
x=x2, y=y2, name=level2, text=x2, textposition='inside',
textfont=dict(size=12),
marker_color=c2), 1, 2, secondary_y=True)
fig.update_layout(width=width, height=height,
xaxis_showticklabels=False,
title=title,
title_x = 0.5,
title_font=dict(size=20, color=font_color, family=csfont['fontname']),
xaxis_title='',
xaxis2_autorange='reversed',
xaxis2_showticklabels=False,
yaxis3_showticklabels=True,
legend=dict(x=0.45, y=+float(h), yanchor="top", xanchor="center", orientation="h"),
margin=dict(l=150, r=0, t=70, b=0))
fig.show()
level_bar(cancel_co_by_country, 'customer_id_loss_pct', 'order_id_loss_pct', 'Loss Percentage by Country')
This is the format of your plot grid: [ (1,1) x,y ] [ (1,2) x2,y2,y3 ]
Сonclusions:
- Countries with the highest number of cancelled or returned orders may indicate issues related to delivery, product quality, or customer service levels in these regions.
- The percentage of losses can serve as an indicator of business process efficiency. A high percentage may indicate a need for improvements in order management processes, while a low percentage may indicate a high level of customer satisfaction.
- For a deeper analysis, other metrics that can help better understand customer behavior, such as total revenue per customer, average order value, or frequency of repeat purchases, can be considered.
# Filter df_basic to include only rows where unit_price is not 0
# Group the data by unit_price, vendor_code, and customer_id
# For each group, join unique values in other columns into a string
# For invoice_date, create a list of unique values and format them to minutes
# Reset the index of the resulting DataFrame
df_vendor = df_basic[df_basic['unit_price'] != 0].groupby(['unit_price', 'vendor_code', 'customer_id']) \
.agg({'order_id': lambda x: ', '.join(x.unique()),
'invoice_date': lambda x: list(x.dt.strftime('%Y-%m-%d %H:%M:%S').unique())}).reset_index()
# Split the string in order_id into a list of strings
df_vendor['order_id'] = df_vendor['order_id'].str.split(', ')
# Filter df_vendor to include only rows where order_id contains more than one item
df_multiple_orders = df_vendor[df_vendor['order_id'].apply(lambda x: len(x) > 1)]
# Filter df_multiple_orders to include only rows where order_id contains a string that starts or ends with 'C'
df_multiple_orders_C = df_multiple_orders[df_multiple_orders['order_id'].apply(lambda x: any(s.startswith('C') or s.endswith('C') for s in x))]
print(f'Number of data rows: {df_multiple_orders_C.shape[0]}')
# Sort df_multiple_orders_C by unit_price in descending order and display the first 5 rows
df_multiple_orders_C.sort_values(by='unit_price', ascending=False).head()
Number of data rows: 6883
| unit_price | vendor_code | customer_id | order_id | invoice_date | |
|---|---|---|---|---|---|
| 406845 | 8142.75 | POST | 16029 | [C551685, 551697] | [2022-05-03 12:51:00, 2022-05-03 13:46:00] |
| 406819 | 4161.06 | M | 12536 | [573077, C573079, 573080] | [2022-10-27 14:13:00, 2022-10-27 14:15:00, 202... |
| 406818 | 3949.32 | M | 12744 | [C571750, 571751] | [2022-10-19 11:16:00, 2022-10-19 11:18:00] |
| 406817 | 3155.95 | M | 15502 | [C562647, 569382] | [2022-08-08 12:56:00, 2022-10-03 16:44:00] |
| 406810 | 2382.92 | M | 12744 | [548813, C548830] | [2022-04-04 13:03:00, 2022-04-04 13:08:00] |
Let's display individual orders to compare.
order_ids = ['573077', 'C573079', '573080', 'C562647', '569382', 'C551685', '551697', '548813', 'C548830', 'C571750', '571751']
df_selected_orders = df_basic[df_basic['order_id'].isin(order_ids)]
df_selected_orders
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 144825 | 548813 | M | 1 | 2022-04-04 13:03:00 | 162.60 | 12744 | Singapore | g_ads_b | 0.07 |
| 144826 | 548813 | M | 1 | 2022-04-04 13:03:00 | 2382.92 | 12744 | Singapore | g_ads_b | 0.07 |
| 144827 | 548813 | M | 1 | 2022-04-04 13:03:00 | 239.30 | 12744 | Singapore | g_ads_b | 0.07 |
| 144828 | 548813 | M | 1 | 2022-04-04 13:03:00 | 1252.95 | 12744 | Singapore | g_ads_b | 0.07 |
| 144830 | C548830 | M | -1 | 2022-04-04 13:08:00 | 162.60 | 12744 | Singapore | g_ads_b | 0.07 |
| 144831 | C548830 | M | -1 | 2022-04-04 13:08:00 | 2382.92 | 12744 | Singapore | g_ads_b | 0.07 |
| 144832 | C548830 | M | -1 | 2022-04-04 13:08:00 | 239.30 | 12744 | Singapore | g_ads_b | 0.07 |
| 144833 | C548830 | M | -1 | 2022-04-04 13:08:00 | 1252.95 | 12744 | Singapore | g_ads_b | 0.07 |
| 173277 | C551685 | POST | -1 | 2022-05-03 12:51:00 | 8142.75 | 16029 | Germany | organic | 0.00 |
| 173382 | 551697 | POST | 1 | 2022-05-03 13:46:00 | 8142.75 | 16029 | Germany | organic | 0.00 |
| 293842 | C562647 | M | -1 | 2022-08-08 12:56:00 | 3155.95 | 15502 | United Kingdom | fb_ads | 0.51 |
| 374542 | 569382 | M | 1 | 2022-10-03 16:44:00 | 3155.95 | 15502 | United Kingdom | g_ads_s | 2.83 |
| 406404 | C571750 | M | -1 | 2022-10-19 11:16:00 | 3949.32 | 12744 | Singapore | g_ads_b | 0.08 |
| 406405 | C571750 | M | -1 | 2022-10-19 11:16:00 | 2118.74 | 12744 | Singapore | g_ads_b | 0.08 |
| 406406 | 571751 | M | 1 | 2022-10-19 11:18:00 | 3949.32 | 12744 | Singapore | g_ads_b | 0.12 |
| 406407 | 571751 | M | 1 | 2022-10-19 11:18:00 | 2118.74 | 12744 | Singapore | g_ads_b | 0.12 |
| 422351 | 573077 | M | 1 | 2022-10-27 14:13:00 | 4161.06 | 12536 | France | other_banner | 0.64 |
| 422375 | C573079 | M | -2 | 2022-10-27 14:15:00 | 4161.06 | 12536 | France | other_banner | 0.64 |
| 422376 | 573080 | M | 1 | 2022-10-27 14:20:00 | 4161.06 | 12536 | France | other_banner | 0.64 |
What conclusions can be drawn?
Based on the date and channel of customer acquisition, paired orders can be divided into 3 groups:
- An order is placed and then cancelled.
- The order was canceled and then re-placed.
- The order is made, then canceled, then made again.
There may be anomalies when receiving data. It is recommended to highlight data with canceled orders and their pairs for further research.
As well as all other orders [startswith "C"] and having ['quantity'<=0] and ['unit_price'<0].
We leave only those rows where all orders starting with 'C' have a later date than all other orders in this row.
# The function checks each row in df_multiple_orders_C
# and leaves only those rows where all orders starting with 'C'
# have a later date than all other orders in that row.
def check_orders(row):
pairs = list(zip(row['order_id'], row['invoice_date']))
c_orders = [date for order, date in pairs if order.startswith('C')]
other_orders = [date for order, date in pairs if not order.startswith('C')]
return all(c_date > other_date for c_date in c_orders for other_date in other_orders)
df_multiple_orders_C_filtered = df_multiple_orders_C[df_multiple_orders_C.apply(check_orders, axis=1)]
df_multiple_orders_C_filtered.sort_values(by='unit_price', ascending=False).head()
| unit_price | vendor_code | customer_id | order_id | invoice_date | |
|---|---|---|---|---|---|
| 406810 | 2382.92 | M | 12744 | [548813, C548830] | [2022-04-04 13:03:00, 2022-04-04 13:08:00] |
| 406804 | 2053.07 | M | 12744 | [548820, C548834] | [2022-04-04 13:04:00, 2022-04-04 13:09:00] |
| 406775 | 1252.95 | M | 12744 | [548813, C548830] | [2022-04-04 13:03:00, 2022-04-04 13:08:00] |
| 406766 | 1136.30 | M | 12669 | [548913, C548914] | [2022-04-05 09:45:00, 2022-04-05 09:46:00] |
| 406644 | 549.34 | M | 12473 | [575632, C575635] | [2022-11-10 13:44:00, 2022-11-10 13:45:00] |
# Get a list of all order_ids from df_multiple_orders_C
orders_to_remove = df_multiple_orders_C_filtered['order_id'].explode().unique()
print(f'Number of data rows removed: {orders_to_remove.shape[0]}')
# Remove these orders from df_basic
df_basic_filtered = df_basic[~df_basic['order_id'].isin(orders_to_remove)]
Number of data rows removed: 5166
Remaining canceled orders after filtering
# Canceled orders
df_basic_filtered_C = df_basic_filtered[df_basic_filtered.order_id.str.startswith('C')]
print(f'Number of data rows: {df_basic_filtered_C.shape[0]}')
df_basic_filtered_C.sort_values(by='customer_id', ascending=False).head(5)
Number of data rows: 2902
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | |
|---|---|---|---|---|---|---|---|---|---|
| 349688 | C567518 | M | -1 | 2022-09-20 16:28:00 | 5.95 | 18308 | EIRE | organic | 0.00 |
| 349690 | C567518 | 21871 | -12 | 2022-09-20 16:28:00 | 1.65 | 18308 | EIRE | organic | 0.00 |
| 349689 | C567518 | 22846 | -1 | 2022-09-20 16:28:00 | 16.95 | 18308 | EIRE | organic | 0.00 |
| 96711 | C544583 | S | -1 | 2022-02-21 14:48:00 | 31.98 | 18298 | Finland | g_ads_b | 0.07 |
| 96710 | C544583 | S | -1 | 2022-02-21 14:48:00 | 37.49 | 18298 | Finland | g_ads_b | 0.07 |
Let's delete the data where 'quantity'<=0 or 'unit_price'<0
df_basic_filtered_remove = df_basic_filtered[((df_basic_filtered['quantity']<=0) \
| (df_basic_filtered['unit_price']<0))]
df_basic_filtered = df_basic_filtered[~((df_basic_filtered['quantity']<=0) \
| (df_basic_filtered['unit_price']<0))]
print(f'Number of data rows removed: {df_basic_filtered_remove.shape[0]}')
print(f'Number of data rows df: {df_basic_filtered.shape[0]}')
Number of data rows removed: 4240 Number of data rows df: 444942
"""
Making a copy of DF.
And we assign "df_basic_filtered" as the main data frame DF.
"""
df_copy_filtered = df_basic_filtered.copy(deep=True)
df = df_basic_filtered.copy(deep=True)
Create an income column called "Revenue". Next, we group the data by orders.
df['revenue'] = df['quantity'] * df['unit_price']
It will be interesting to see the maximum values. Perhaps these are errors.
max_quantity_rows = df.nlargest(5, 'quantity')
max_unit_price_rows = df.nlargest(5, 'unit_price')
max_quantity_rows
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 502122 | 578841 | 84826 | 12540 | 2022-11-25 15:57:00 | 0.00 | 13256 | Japan | g_ads_s | 2.59 | 0.0 |
| 74614 | 542504 | 37413 | 5568 | 2022-01-28 12:03:00 | 0.00 | 12020 | United Kingdom | organic | 0.00 | 0.0 |
| 421632 | 573008 | 84077 | 4800 | 2022-10-27 12:26:00 | 0.21 | 12901 | Canada | other_banner | 0.51 | 1008.0 |
| 206121 | 554868 | 22197 | 4300 | 2022-05-27 10:52:00 | 0.72 | 13135 | USA | organic | 0.00 | 3096.0 |
| 220843 | 556231 | 85123A | 4000 | 2022-06-09 15:04:00 | 0.00 | 16028 | Italy | g_ads_s | 2.89 | 0.0 |
max_unit_price_rows
| order_id | vendor_code | quantity | invoice_date | unit_price | customer_id | country | channel | cost | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 15017 | 537632 | AMAZONFEE | 1 | 2022-12-07 15:08:00 | 13541.33 | 13179 | Lebanon | organic | 0.00 | 13541.33 |
| 299982 | A563185 | B | 1 | 2022-08-12 14:50:00 | 11062.06 | 16691 | USA | g_ads_s | 0.18 | 11062.06 |
| 173382 | 551697 | POST | 1 | 2022-05-03 13:46:00 | 8142.75 | 16029 | Germany | organic | 0.00 | 8142.75 |
| 297723 | 562955 | DOT | 1 | 2022-08-11 10:14:00 | 4505.17 | 16300 | USA | organic | 0.00 | 4505.17 |
| 268028 | 560373 | M | 1 | 2022-07-18 12:30:00 | 4287.63 | 14695 | USA | fb_ads | 0.91 | 4287.63 |
The results are questionable. These orders must be looked at separately.
df_grouped = df.groupby('order_id').agg({
'quantity': 'sum',
'invoice_date': 'first',
'customer_id': 'first',
'country': 'first',
'channel': 'first',
'cost': 'first',
'revenue': 'sum'
}).reset_index()
print(f'Number of data rows: {df_grouped.shape[0]}')
Number of data rows: 17796
"""
Making a copy of DF.
And we assign DF_GROUPED as the main data frame DF.
"""
df_copy = df.copy(deep=True)
df = df_grouped.copy(deep=True)
df_grouped.describe(percentiles=[.01, .05, .95, .99]).style.background_gradient(cmap='YlGnBu')
| quantity | cost | revenue | |
|---|---|---|---|
| count | 17796.000000 | 17796.000000 | 17796.000000 |
| mean | 249.421106 | 0.531669 | 479.720345 |
| std | 530.478250 | 0.766268 | 1180.294436 |
| min | 1.000000 | 0.000000 | 0.000000 |
| 1% | 1.000000 | 0.000000 | 0.000000 |
| 5% | 4.000000 | 0.000000 | 3.250000 |
| 50% | 134.000000 | 0.140000 | 270.000000 |
| 95% | 750.000000 | 2.442500 | 1503.312500 |
| 99% | 2112.450000 | 3.290000 | 4702.741000 |
| max | 14730.000000 | 3.470000 | 52940.940000 |
# Create a DataFrame with dummy variables for each channel
channel_dummies = df['channel'].str.get_dummies(sep=', ')
channel_dummies['order_id'] = df['order_id']
channel_dummies['customer_id'] = df['customer_id']
channel_dummies['cost'] = df['cost']
channel_dummies['revenue_d'] = df['revenue']
channel_counts = channel_dummies.groupby(['customer_id', 'order_id']).sum()
# Add columns with sums of cost and revenue for each channel
for channel in channel_dummies.columns:
if channel not in ['order_id', 'customer_id', 'cost', 'revenue_d']:
channel_counts[channel+'_cost'] = channel_counts[channel] \
* channel_counts['cost']
channel_counts[channel+'_revenue'] = channel_counts[channel] \
* channel_counts['revenue_d']
# Sum the number of orders for each 'customer_id'
channel_counts = channel_counts.groupby('customer_id').sum()
# channel_counts = channel_counts.reset_index()
# Grouping by customer
df_customers = df.groupby('customer_id').agg({
'revenue': 'sum',
'country': 'first',
'order_id': 'count',
'cost': 'sum',
'invoice_date': ['min', 'max']
})
df_customers.columns = ['revenue', 'country', 'order_count', 'total_cost', 'first_date_invoice', 'last_date_invoice']
df_customers = df_customers.sort_values('revenue', ascending=False).reset_index()
# Adding data by channels
df_customers = df_customers.join(channel_counts, on='customer_id')
df_customers.drop(columns='cost', inplace=True)
df_customers.head(2)
| customer_id | revenue | country | order_count | total_cost | first_date_invoice | last_date_invoice | fb_ads | g_ads_b | g_ads_s | ... | fb_ads_cost | fb_ads_revenue | g_ads_b_cost | g_ads_b_revenue | g_ads_s_cost | g_ads_s_revenue | organic_cost | organic_revenue | other_banner_cost | other_banner_revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14646 | 280206.02 | Netherlands | 74 | 38.45 | 2022-01-14 12:24:00 | 2022-12-21 15:05:00 | 18 | 14 | 6 | ... | 18.19 | 56300.41 | 1.20 | 52855.32 | 11.97 | 19601.70 | 0.0 | 95733.13 | 7.09 | 55715.46 |
| 1 | 18102 | 259657.30 | Germany | 60 | 61.85 | 2022-02-07 15:34:00 | 2022-12-09 13:44:00 | 19 | 10 | 15 | ... | 18.00 | 98328.94 | 0.95 | 43039.27 | 38.27 | 39960.98 | 0.0 | 45071.41 | 4.63 | 33256.70 |
2 rows × 23 columns
df_customers[['revenue', 'order_count', 'total_cost']].describe(percentiles=[.01, .05, .95, .99]).style.background_gradient(cmap='YlGnBu')
| revenue | order_count | total_cost | |
|---|---|---|---|
| count | 5416.000000 | 5416.000000 | 5416.000000 |
| mean | 1576.274606 | 3.285820 | 1.746970 |
| std | 6740.199050 | 5.753437 | 3.405160 |
| min | 0.000000 | 1.000000 | 0.000000 |
| 1% | 0.000000 | 1.000000 | 0.000000 |
| 5% | 0.000000 | 1.000000 | 0.000000 |
| 50% | 547.030000 | 2.000000 | 0.810000 |
| 95% | 5023.432500 | 10.000000 | 6.062500 |
| 99% | 13492.269500 | 22.850000 | 13.471000 |
| max | 280206.020000 | 198.000000 | 96.730000 |
Checksum verification between datasets
total_revenue_customers = df_customers['revenue'].sum()
total_revenue_df = df['revenue'].sum()
print(f"Sum of all revenue in df_customers >>>> {total_revenue_customers}")
print(f"Sum of all revenue in df >>>> {total_revenue_df}")
Sum of all revenue in df_customers >>>> 8537103.264 Sum of all revenue in df >>>> 8537103.264
You need to look at the data where revenue is 0.
And also orders with maximum revenue
df_grouped_nonpaying=df_grouped[df_grouped['revenue']==0]
print(f'Number of data rows revenue by orders = 0 >>>>> {df_grouped_nonpaying.shape[0]}')
df_grouped_nonpaying.head()
Number of data rows revenue by orders = 0 >>>>> 766
| order_id | quantity | invoice_date | customer_id | country | channel | cost | revenue | |
|---|---|---|---|---|---|---|---|---|
| 41 | 536414 | 56 | 2022-12-01 11:52:00 | 15409 | United Kingdom | g_ads_s | 1.03 | 0.0 |
| 80 | 536545 | 1 | 2022-12-01 14:32:00 | 18060 | Germany | organic | 0.00 | 0.0 |
| 81 | 536546 | 1 | 2022-12-01 14:33:00 | 12533 | USA | organic | 0.00 | 0.0 |
| 82 | 536547 | 1 | 2022-12-01 14:33:00 | 12016 | Czech Republic | organic | 0.00 | 0.0 |
| 83 | 536549 | 1 | 2022-12-01 14:34:00 | 17077 | United Kingdom | organic | 0.00 | 0.0 |
max_revenue_rows = df.nlargest(10, 'revenue')
max_revenue_rows.head(5)
| order_id | quantity | invoice_date | customer_id | country | channel | cost | revenue | |
|---|---|---|---|---|---|---|---|---|
| 14895 | 574941 | 14149 | 2022-11-07 17:42:00 | 15401 | Spain | organic | 0.00 | 52940.94 |
| 15492 | 576365 | 13956 | 2022-11-14 17:55:00 | 15305 | USA | g_ads_b | 0.05 | 50653.91 |
| 7618 | 556444 | 60 | 2022-06-10 15:28:00 | 15098 | Germany | organic | 0.00 | 38970.00 |
| 13773 | 572209 | 1920 | 2022-10-21 12:08:00 | 18102 | Germany | fb_ads | 1.31 | 22206.00 |
| 11868 | 567381 | 6760 | 2022-09-20 10:12:00 | 17450 | USA | g_ads_b | 0.14 | 22104.80 |
df_customers_nonpaying = df_customers[df_customers['revenue']==0]
print(f'Number of data rows revenue by customer = 0 >>>>> {df_customers_nonpaying.shape[0]}')
df_customers_nonpaying.head()
Number of data rows revenue by customer = 0 >>>>> 299
| customer_id | revenue | country | order_count | total_cost | first_date_invoice | last_date_invoice | fb_ads | g_ads_b | g_ads_s | ... | fb_ads_cost | fb_ads_revenue | g_ads_b_cost | g_ads_b_revenue | g_ads_s_cost | g_ads_s_revenue | organic_cost | organic_revenue | other_banner_cost | other_banner_revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5117 | 18098 | 0.0 | Austria | 1 | 1.72 | 2022-10-06 17:45:00 | 2022-10-06 17:45:00 | 0 | 0 | 1 | ... | 0.00 | 0.0 | 0.00 | 0.0 | 1.72 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5118 | 18197 | 0.0 | USA | 1 | 0.10 | 2022-06-17 09:34:00 | 2022-06-17 09:34:00 | 0 | 1 | 0 | ... | 0.00 | 0.0 | 0.10 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5119 | 18195 | 0.0 | Germany | 1 | 2.85 | 2022-05-03 13:36:00 | 2022-05-03 13:36:00 | 0 | 0 | 1 | ... | 0.00 | 0.0 | 0.00 | 0.0 | 2.85 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5120 | 18275 | 0.0 | Australia | 1 | 0.00 | 2022-04-08 11:29:00 | 2022-04-08 11:29:00 | 0 | 0 | 0 | ... | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5121 | 18303 | 0.0 | USA | 3 | 3.24 | 2022-03-28 15:19:00 | 2022-06-15 16:56:00 | 1 | 1 | 1 | ... | 1.53 | 0.0 | 0.04 | 0.0 | 1.67 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 23 columns
def outliers(df, column='revenue'):
df_all = df[df[column] > 0]
quartile1 = df_all[column].quantile(0.25)
quartile3 = df_all[column].quantile(0.75)
interquantile_range = quartile3 - quartile1
min_y = quartile3 + 1.5*interquantile_range
max_val = df_all[column].max()
max_y = max_val + (max_val*0.1)
# Confidence interval for the 'REVENUE' value
print(f"Confidence interval for the '{column}' value before{sms.DescrStatsW(df_all[column]).tconfint_mean()}")
quartile_99 = df_all[column].quantile(0.99)
df_all_99 = df_all[df_all[column] < quartile_99]
df_all_1 = df_all[df_all[column] >= quartile_99]
print(f"Confidence interval for the '{column}' value \nafter removing outliers{sms.DescrStatsW(df_all_99[column]).tconfint_mean()}")
print(f"Quartile_99 >>> {round(quartile_99, 4)}")
return df_all, df_all_99, df_all_1
df_all, df_all_99, df_all_1 = outliers(df)
df_all_customers, df_all_99_customers, df_all_1_customers = outliers(df_customers)
Confidence interval for the 'revenue' value before(483.2429294509636, 519.3528737199114) Confidence interval for the 'revenue' value after removing outliers(406.76909749649883, 422.9835775138814) Quartile_99 >>> 4869.1813 Confidence interval for the 'revenue' value before(1478.6424091674514, 1858.1186867872104) Confidence interval for the 'revenue' value after removing outliers(1181.2840420200062, 1275.829795689767) Quartile_99 >>> 13920.932
def plot_outliers(df_all, df_all_99, x='order_id', column='revenue'):
quartile1 = df_all[column].quantile(0.25)
quartile3 = df_all[column].quantile(0.75)
interquantile_range = quartile3 - quartile1
min_y = quartile3 + 1.5*interquantile_range
max_val = df_all[column].max()
max_y = max_val + (max_val*0.1)
plt.figure(figsize=(10,4))
plt.suptitle(f"Distribution outliers ({column}/{x}) plot before and after removal",
fontsize=20, color=font_color, **csfont)
plt.subplot(121)
sns.scatterplot(data=df_all, x=x, y=column)
plt.ylim(min_y, max_y)
ax = plt.gca()
ax.get_xaxis().set_visible(False)
plt.subplot(122)
sns.scatterplot(data=df_all_99, x=x, y=column)
plt.ylim(min_y, max_y)
ax = plt.gca()
ax.get_xaxis().set_visible(False)
plt.show()
plot_outliers(df_all, df_all_99)
plot_outliers(df_all_customers, df_all_99_customers, x='customer_id')
sum_df_all_1_customers = df_all_1_customers['revenue'].sum()
sum_df_all_99_customers = df_all_99_customers['revenue'].sum()
sum_df_all_1 = df_all_1['revenue'].sum()
sum_df_all_99 = df_all_99['revenue'].sum()
# Calculate the percentage
percentage_customers = (sum_df_all_1_customers / sum_df_all_99_customers) * 100
percentage = (sum_df_all_1 / sum_df_all_99) * 100
# Create a dataframe with these values
df_customers_pie = pd.DataFrame({'Sum': [sum_df_all_99_customers, sum_df_all_1_customers], 'Percentage': [100, percentage_customers]},
index=['sum_df_all_99_customers', 'sum_df_all_1_customers'])
df_pie = pd.DataFrame({'Sum': [sum_df_all_99, sum_df_all_1], 'Percentage': [100, percentage]},
index=['sum_df_all_99', 'sum_df_all_1'])
# Plot a pie chart with these values
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
patches1, texts1, autotexts1 = axs[0].pie(df_pie['Percentage'],
explode=[0, 0.1], pctdistance=0.6, colors=[c1, c2],
autopct=lambda p: '{:.1f}%\n({:.0f})'.format(p, p * sum(df_pie['Sum']) / 100))
axs[0].set_title('by Orders', pad=5, fontsize=14, color=font_color, **csfont)
patches2, texts2, autotexts2 = axs[1].pie(df_customers_pie['Percentage'],
explode=[0, 0.1], pctdistance=0.6, colors=[c1, c2],
autopct=lambda p: '{:.1f}%\n({:.0f})'.format(p, p * sum(df_customers_pie['Sum']) / 100))
axs[1].set_title('by Customers', pad=5, fontsize=14, color=font_color, **csfont)
ti = fig.suptitle('Percentage of revenue amount 1% (data outliers)', fontsize=18, color=font_color, **csfont)
ti.set_position([0.55, 1.03])
# Customize text labels
for autotext in autotexts1+autotexts2:
autotext.set_color('white')
autotext.set_fontsize(14)
autotext.set_horizontalalignment('center')
autotext.set_verticalalignment('center')
# Add legend
labels = ['sum 99%', 'sum 1%']
fig.legend([patches1[0], patches1[1]],
labels=labels,
loc="upper center",
borderaxespad=3,
title="Total Revenue")
plt.show()
- Clients and orders with zero revenue, but funds were spent on attraction. Such data can be deleted.
- There are highly profitable orders and customers. If we take data on orders, then 1% of orders makes up 18,1% of total revenue. And if we take data on customers, then 1% of customers brought in 27,1% of the total income. Such clients cannot be excluded, they must be considered individually.
paying_customers = df_customers[df_customers['revenue'] > 0]
country_data = paying_customers.groupby('country').agg({'customer_id': 'count', 'order_count': 'sum', 'revenue': 'sum'})
country_data.rename(columns={'customer_id': 'p_customers', 'order_count': 'num_orders', 'revenue': 'total_revenue'}, inplace=True)
country_data['cus_per'] = round((country_data['p_customers'] / country_data['p_customers'].sum()) * 100, 2)
country_data['rev_per'] = round((country_data['total_revenue'] / country_data['total_revenue'].sum()) * 100, 2)
country_data['customer_quality'] = round(country_data['rev_per'] / country_data['cus_per'], 2)
country_data.sort_values('total_revenue', ascending=False, inplace=True)
country_data
| p_customers | num_orders | total_revenue | cus_per | rev_per | customer_quality | |
|---|---|---|---|---|---|---|
| country | ||||||
| USA | 1530 | 4831 | 2175922.781 | 29.90 | 25.49 | 0.85 |
| United Kingdom | 1261 | 4451 | 1996822.431 | 24.64 | 23.39 | 0.95 |
| Germany | 498 | 1770 | 1021030.100 | 9.73 | 11.96 | 1.23 |
| Canada | 236 | 976 | 428705.481 | 4.61 | 5.02 | 1.09 |
| Netherlands | 51 | 227 | 331192.630 | 1.00 | 3.88 | 3.88 |
| EIRE | 52 | 327 | 232400.340 | 1.02 | 2.72 | 2.67 |
| France | 116 | 443 | 215893.650 | 2.27 | 2.53 | 1.11 |
| Spain | 69 | 228 | 166548.720 | 1.35 | 1.95 | 1.44 |
| Australia | 37 | 147 | 134453.020 | 0.72 | 1.57 | 2.18 |
| Lithuania | 44 | 299 | 110253.810 | 0.86 | 1.29 | 1.50 |
| Greece | 49 | 192 | 106252.970 | 0.96 | 1.24 | 1.29 |
| Belgium | 62 | 228 | 100144.630 | 1.21 | 1.17 | 0.97 |
| Singapore | 45 | 142 | 98346.800 | 0.88 | 1.15 | 1.31 |
| Denmark | 48 | 191 | 93655.180 | 0.94 | 1.10 | 1.17 |
| Sweden | 46 | 164 | 89937.170 | 0.90 | 1.05 | 1.17 |
| Czech Republic | 37 | 185 | 85846.730 | 0.72 | 1.01 | 1.40 |
| Switzerland | 47 | 105 | 81053.950 | 0.92 | 0.95 | 1.03 |
| United Arab Emirates | 39 | 122 | 75358.520 | 0.76 | 0.88 | 1.16 |
| Finland | 53 | 228 | 74129.690 | 1.04 | 0.87 | 0.84 |
| Israel | 46 | 142 | 69874.950 | 0.90 | 0.82 | 0.91 |
| Portugal | 44 | 155 | 68336.710 | 0.86 | 0.80 | 0.93 |
| Channel Islands | 45 | 134 | 66768.020 | 0.88 | 0.78 | 0.89 |
| RSA | 46 | 132 | 64790.780 | 0.90 | 0.76 | 0.84 |
| Austria | 51 | 142 | 62269.750 | 1.00 | 0.73 | 0.73 |
| Japan | 51 | 109 | 62154.060 | 1.00 | 0.73 | 0.73 |
| Norway | 48 | 120 | 61192.820 | 0.94 | 0.72 | 0.77 |
| Brazil | 41 | 126 | 56698.040 | 0.80 | 0.66 | 0.82 |
| Poland | 47 | 140 | 52175.960 | 0.92 | 0.61 | 0.66 |
| Hong Kong | 49 | 153 | 48177.280 | 0.96 | 0.56 | 0.58 |
| Lebanon | 45 | 100 | 46930.600 | 0.88 | 0.55 | 0.62 |
| Cyprus | 43 | 114 | 44943.640 | 0.84 | 0.53 | 0.63 |
| Bahrain | 41 | 115 | 41138.190 | 0.80 | 0.48 | 0.60 |
| Italy | 42 | 91 | 36300.870 | 0.82 | 0.43 | 0.52 |
| Iceland | 36 | 114 | 34148.500 | 0.70 | 0.40 | 0.57 |
| European Community | 23 | 58 | 29932.640 | 0.45 | 0.35 | 0.78 |
| Malta | 35 | 77 | 26977.011 | 0.68 | 0.32 | 0.47 |
| Unspecified | 36 | 73 | 25926.940 | 0.70 | 0.30 | 0.43 |
| Saudi Arabia | 28 | 74 | 20417.900 | 0.55 | 0.24 | 0.44 |
"""Unattractive graph due to large discrepancies between indicators"""
# barh_plot(country_data, x1 = 'rev_per', x2 = 'cus_per',
# title = 'Percentage of Total', label1='Revenue %',
# label2='Customers %', sortby='rev_per')
'Unattractive graph due to large discrepancies between indicators'
level_bar(country_data, 'rev_per', 'customer_quality', 'Percentage of Revenue and Customer Quality')
This is the format of your plot grid: [ (1,1) x,y ] [ (1,2) x2,y2,y3 ]
From the data, we can draw the following conclusions:
- The USA is the country with the highest number of paying customers (1530), accounting for 29.9% of the total number of paying customers. This country also brings in the highest revenue - 2175922.781, which is 25.49% of the total revenue.
- The second-largest market is the United Kingdom with 1261 paying customers (24.64% of the total) and a revenue of 1996822.431 (23.39% of the total revenue).
- Germany ranks third with 498 paying customers (9.73% of the total) and a revenue of 1021030.100 (11.96% of the total revenue).
- Despite the Netherlands having only 1% of the total number of paying customers (51 customers), they contribute 3.88% of the total revenue, indicating a high average order value in this country.
- Saudi Arabia has the lowest performance, accounting for just 0.55% of total paying customers (28 customers), contributing 0.24% of total revenue, which also indicates a relatively low average order value.
Preliminarily Based on the relative quality of customers, we can draw a conclusion about promising sales regions.
channels = ['fb_ads', 'g_ads_b', 'g_ads_s', 'organic', 'other_banner']
channel_data = pd.DataFrame()
# Calculate total number of paying customers across all channels
total_paying_customers = sum((paying_customers[channel+'_revenue'] > 0).sum() for channel in channels)
# Calculate statistics for each channel
for channel in channels:
num_paying_customers = (paying_customers[channel+'_revenue'] > 0).sum()
customer_percentage= (num_paying_customers / total_paying_customers) * 100
customer_percentage_unq = (num_paying_customers / paying_customers.shape[0]) * 100
num_orders = paying_customers[paying_customers[channel+'_revenue'] > 0]['order_count'].sum()
total_revenue = paying_customers[channel+'_revenue'].sum()
total_cost = paying_customers[channel+'_cost'].sum()
revenue_percentage = (total_revenue / paying_customers['revenue'].sum()) * 100
channel_data = channel_data.append(pd.Series({
'channel': channel,
'p_customers': num_paying_customers,
'customer_per': customer_percentage,
'customer_per_unq': customer_percentage_unq,
'num_orders': num_orders,
'total_revenue': total_revenue,
'total_cost': total_cost,
'revenue_per': revenue_percentage
}), ignore_index=True)
channel_data.set_index('channel', inplace=True)
channel_data.sort_values('total_revenue', ascending=False, inplace=True)
channel_data
| p_customers | customer_per | customer_per_unq | num_orders | total_revenue | total_cost | revenue_per | |
|---|---|---|---|---|---|---|---|
| channel | |||||||
| organic | 2777 | 27.626343 | 54.270080 | 13276 | 2705519.191 | 0.00 | 31.691302 |
| fb_ads | 2022 | 20.115400 | 39.515341 | 11210 | 1752351.471 | 2965.15 | 20.526301 |
| g_ads_b | 1839 | 18.294867 | 35.939027 | 10587 | 1587409.852 | 261.04 | 18.594244 |
| other_banner | 1779 | 17.697971 | 34.766465 | 10539 | 1296409.400 | 1332.60 | 15.185589 |
| g_ads_s | 1635 | 16.265420 | 31.952316 | 9916 | 1195413.350 | 4694.11 | 14.002564 |
df.groupby('channel') \
.agg ({'cost': 'mean'}) \
.sort_values('cost', ascending = False).rename(columns={'cost':'mean_cost'})
| mean_cost | |
|---|---|
| channel | |
| g_ads_s | 1.822495 |
| fb_ads | 0.843417 |
| other_banner | 0.462231 |
| g_ads_b | 0.086499 |
| organic | 0.000000 |
fig, ax = plt.subplots()
total_cost_list = channel_data['total_cost'].tolist()
wedges, texts, autotexts = ax.pie(channel_data['revenue_per'],
autopct='%1.1f%%',
colors=[c2 if x != max(total_cost_list) else c1 for x in total_cost_list],
textprops=dict(color="w"), explode=[0, 0, 0, 0, 0.1],
pctdistance=0.7, labeldistance=0.8)
# Draw a white circle at the center (for a "donut" chart)
centre_circle = plt.Circle((0,0),0.40,fc='white')
fig.gca().add_artist(centre_circle)
title = plt.title('Percentage of revenue by acquisition channels and (advertising costs)',
pad=10, fontsize=16, color=font_color, **csfont)
ax.axis('equal')
# Annotate each sector with the corresponding channel and total_cost
for i, p in enumerate(wedges):
ang = (p.theta2 - p.theta1)/2. + p.theta1
y = np.sin(np.deg2rad(ang))
x = np.cos(np.deg2rad(ang))
horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
connectionstyle = "angle,angleA=0,angleB={}".format(ang)
ax.annotate(f"{channel_data.index[i]} ({total_cost_list[i]:.1f})", xy=(x/0.8, y/1.1),
horizontalalignment=horizontalalignment, fontsize=12, **csfont)
plt.setp(autotexts, size=18, weight="bold", **hfont)
plt.show()
Results:
- Organic traffic turns out to be the most effective customer acquisition channel, attracting the largest number of paying customers (2,777, representing 27.6% of total paying customers) and generating the highest revenue - 2,705,519,191, representing 31.7% from total income. It's important to note that no money was spent on acquiring these customers.
- Facebook Ads (fb_ads) are the second most effective, attracting 2,022 paying customers (20.1% of total paying customers) and generating 1,752,351,471 in revenue (20.5% of total revenue). 2965.15 was spent on attracting these clients. Average acquisition cost per purchase - 0,84.
- Google Ads Media (g_ads_b) ranks third, attracting 1,839 paying customers (18.3% of total paying customers) and generating 1,587,409,852 in revenue (18.6% of total revenue). 261.04 was spent on attracting these clients. Great, almost free. Average acquisition cost per purchase - 0,086.
- Other banners (other_banner) attract 1779 paying customers (17.7% of total paying customers) and generate 1296409.400 in income (15.2% of total income). 1332.60 was spent on attracting these clients. Average acquisition cost per purchase - 0,46.
- Advertising in Google Ads in search results (g_ads_s) was the least effective, attracting 1635 paying customers (16.3% of total paying customers) and generating 1195413.350 in revenue (14.0% of total revenue). The largest amount of funds was spent on attracting these clients - 4694.11. Average acquisition cost per purchase - 1,82.
In general, we can say that Organic traffic is the most effective channel for attracting customers, as it generates the most revenue and does not require any costs.
However, advertising in Google Ads Search in search results turned out to be the least effective, despite the highest costs.
df['month'] = df['invoice_date'].dt.to_period('M')
monthly_data = df.groupby('month').agg({
'cost': 'sum',
'revenue': 'sum',
'customer_id': 'nunique',
'order_id': 'count'
})
monthly_data.sort_index(inplace=True)
monthly_data.columns = ['adv_cost', 'total_revenue', 'unique_customers', 'total_orders']
monthly_data['arpu'] = monthly_data['total_revenue'] / monthly_data['total_orders']
monthly_data['arppu'] = monthly_data['total_revenue'] / monthly_data['unique_customers']
monthly_data
| adv_cost | total_revenue | unique_customers | total_orders | arpu | arppu | |
|---|---|---|---|---|---|---|
| month | ||||||
| 2022-01 | 476.54 | 459389.720 | 733 | 915 | 502.065268 | 626.725402 |
| 2022-02 | 495.94 | 390192.680 | 733 | 899 | 434.029677 | 532.322892 |
| 2022-03 | 704.01 | 575412.880 | 1014 | 1266 | 454.512543 | 567.468323 |
| 2022-04 | 561.09 | 437140.991 | 893 | 1121 | 389.956281 | 489.519587 |
| 2022-05 | 734.43 | 606307.600 | 1047 | 1433 | 423.103699 | 579.090353 |
| 2022-06 | 699.85 | 600571.150 | 1020 | 1322 | 454.289826 | 588.795245 |
| 2022-07 | 625.42 | 574228.191 | 1025 | 1298 | 442.394600 | 560.222625 |
| 2022-08 | 720.15 | 571969.030 | 933 | 1196 | 478.234975 | 613.042905 |
| 2022-09 | 842.42 | 834698.432 | 1219 | 1587 | 525.959945 | 684.740305 |
| 2022-10 | 917.26 | 951023.290 | 1400 | 1845 | 515.459778 | 679.302350 |
| 2022-11 | 1500.41 | 1323610.690 | 1751 | 2605 | 508.103912 | 755.917013 |
| 2022-12 | 1184.07 | 1212558.610 | 1489 | 2309 | 525.144482 | 814.344265 |
- Advertising costs (adv_cost) increase over time. At the beginning of 2022 January, the costs were 476.54, and by the end of the year, they increased to 1184.07. It is particularly noteworthy that there was a significant increase in costs starting from September, which may be associated with an increase in advertising campaigns in anticipation of the holiday season.
- The highest costs were in November 2022 (1500.41). This could be associated with an increase in advertising campaigns in anticipation of the holiday season.
- The lowest costs were in January, February, and April 2022, amounting to 476.54, 495.94, and 561.09 respectively. This could be associated with fewer advertising campaigns after the New Year holidays and during the spring break period. However, despite the low advertising costs in January, the ARPU (502,06) in this month was above average, indicating the high effectiveness of advertising campaigns during this period.
- Overall, advertising costs vary from month to month, which could be associated with various factors such as seasonality, special advertising campaigns, and changes in advertising strategy.
- Total revenue (total_revenue) also increases over time, which may be associated with an increase in advertising costs. It is particularly noteworthy that there was a significant increase in revenue starting from September (1323611), which coincides with the increase in advertising costs.
- Average Revenue Per User (ARPU) and Average Revenue Per Paying User (ARPPU) also increase over time, indicating that the quality of attracted users is improving. The lowest ARPU and ARPPU values are observed in February and April, which coincide with the periods of the lowest advertising costs.
- The number of unique users (unique_customers) and the total number of orders (total_orders) also increase over time, indicating successful attraction of new users and increased activity of existing users. It is particularly noteworthy that there was a significant increase in these indicators starting from September, which coincides with the increase in advertising costs.
- The best months for all indicators are November and December. ARPPU was $755 and $814, respectively.
df_customers['first_dt'] = pd.to_datetime(df_customers['first_date_invoice']).dt.date
df['invoice_dt'] = pd.to_datetime(df['invoice_date']).dt.date
df_customers.rename(columns={'revenue':'total_rev'}, inplace=True)
paying_customers = df_customers[df_customers['total_rev'] > 0]
orders = df[df['revenue'] > 0]
# remove organic users, we dont pay anything for them
orders_wo = orders.query('channel !="organic"')
# Group by date and feature if provided, calculate total revenue and number of unique customers
def calculate_daily_arppu(df, feature=None):
if feature:
daily_data = df.groupby(['invoice_dt', feature]).agg({'revenue': 'sum', 'customer_id': 'nunique', 'order_id': 'count'}).reset_index()
else:
daily_data = df.groupby('invoice_dt').agg({'revenue': 'sum', 'customer_id': 'nunique', 'order_id': 'count'}).reset_index()
# Calculate ARPPU
daily_data['arppu'] = daily_data['revenue'] / daily_data['customer_id']
return daily_data
def plot_daily_data(df, feature=None, window=7, font_color=font_color, csfont=csfont):
sns.set(context='notebook', style='darkgrid', palette='Set2', font='sans-serif', font_scale=1, color_codes=False, rc=None)
plt.rcParams.update({'font.size': 16, 'grid.color': 'w',
'grid.linestyle': '-', 'grid.linewidth': 1.5, 'grid.alpha': 0.9})
color_palette = plt.get_cmap('tab10')
# If a feature is provided, plot data for each value of the feature
if feature:
values = df[feature].unique()
# If more than 5 unique values, select top 5 by total revenue and ARPPU
if len(values) > 5:
top_values_revenue = df.groupby(feature)['revenue'].sum().nlargest(5).index
top_values_arppu = df.groupby(feature)['arppu'].mean().nlargest(5).index
values = set(top_values_revenue).union(set(top_values_arppu))
fig, axs = plt.subplots(2, 1, figsize=(8, 8))
for i, value in enumerate(values):
filtered_data = df[df[feature] == value]
# Apply rolling mean
filtered_data['revenue'] = filtered_data['revenue'].rolling(window=window).mean()
filtered_data['arppu'] = filtered_data['arppu'].rolling(window=window).mean()
axs[0].plot(filtered_data['invoice_dt'], filtered_data['revenue'], label=value, color=color_palette(i))
axs[1].plot(filtered_data['invoice_dt'], filtered_data['arppu'], label=value, color=color_palette(i))
axs[0].set_title('Revenue over time', fontsize=18, color=font_color, **csfont)
axs[0].set_xlabel('')
axs[0].set_ylabel('')
axs[0].legend(title=feature)
axs[1].set_title('ARPPU over time', fontsize=18, color=font_color, **csfont)
axs[1].set_xlabel('')
axs[1].set_ylabel('')
axs[1].legend(title=feature)
else:
# If no feature is provided, plot total data
fig, axs = plt.subplots(2, 1, figsize=(7, 7))
# Apply rolling mean
df['revenue'] = df['revenue'].rolling(window=window).mean()
df['arppu'] = df['arppu'].rolling(window=window).mean()
axs[0].plot(df['invoice_dt'], df['revenue'])
axs[0].set_title('Revenue over time', fontsize=18, color=font_color, **csfont)
axs[0].set_xlabel('')
axs[0].set_ylabel('')
axs[1].plot(df['invoice_dt'], df['arppu'])
axs[1].set_title('ARPPU over time', fontsize=18, color=font_color, **csfont)
axs[1].set_xlabel('')
axs[1].set_ylabel('')
plt.tight_layout()
plt.show()
daily_arppu = calculate_daily_arppu(df)
plot_daily_data(daily_arppu, window=7)
- In general, Revenue and ARPPU fell after January, and then began to grow over time.
- At the same time, drawdowns are visible in some weeks of the month Revenue and ARPPU grow over time
daily_arppu = calculate_daily_arppu(df, 'channel')
plot_daily_data(daily_arppu, 'channel', window=60)
- The dynamics of income over time by channel are positive. It is best from the Organic channel.
- The time distribution of ARPPU is growing. The following channels can be designated as outsiders: other_banner, g_ads_s.
daily_arppu = calculate_daily_arppu(df, 'country')
plot_daily_data(daily_arppu, 'country', window=60)
- In terms of income, growth dynamics are more pronounced in the following countries: USA, UK, Germany.
- According to ARPPU, the dynamics of the distribution over time shows that in some months there is a slight drop. By collapsing the chart weekly, you can monitor the dynamics for individual countries.
LTV (Customer Lifetime Value) is a marketing metric that measures the total value a customer can bring to a business over the length of their relationship with the company. This can help businesses prioritize their marketing efforts and understand the potential long-term impact of individual customers. LTV is typically calculated by taking a customer's past spending, and extrapolating the expected future spending based on their behavior. It is used by businesses across various industries to gain a competitive edge, make informed decisions, and improve their overall profitability.
ROI (Return On Investment) is the profit or return generated by an investment. It is calculated by subtracting the initial investment cost from the total revenue generated, and dividing the result by the initial investment cost. A positive ROI indicates that the investment was profitable, while a negative ROI indicates that the investment was not profitable.
CAC stands for Cost Average Conversion. It is a marketing metric used to calculate the cost of acquiring a new customer and converting them into a paying customer. It involves dividing the total cost of acquiring customers by the number of customers acquired. CAC is an essential metric for businesses to track and manage their customer acquisition costs and make
# last date of observation
observation_date = datetime.datetime(2022, 12, 31).date()
# Analysis horizon
horizon_days = 30
def get_ltv_roi(customers_df, orders_df, observation_date, horizon_days, dimensions=[], ignore_horizon=False):
"""
Calculate LTV and ROI for given customers and orders data.
"""
# Exclude users who did not "live" to the analysis horizon
last_acquisition_date = observation_date
if not ignore_horizon:
last_acquisition_date = observation_date - datetime.timedelta(days=horizon_days - 1)
raw_result = customers_df.query('first_dt <= @last_acquisition_date')
# Add purchase data to profiles
raw_result = raw_result.merge(orders_df[['customer_id', 'order_id', 'invoice_dt', 'revenue', 'channel', 'cost']], on='customer_id', how='left')
# Calculate user lifetime for each purchase
raw_result['lifetime'] = (raw_result['invoice_dt'] - raw_result['first_dt']).dt.days
# Calculate profit
raw_result['profit'] = raw_result['revenue'] - raw_result['cost']
# Group by cohort if there is nothing in dimensions
if len(dimensions) == 0:
raw_result['cohort'] = 'All users'
dimensions = dimensions + ['cohort']
# Function to group by desired features
def group_by_dimensions(df, dims, horizon_days):
# Build "triangular" profit table
result = df.pivot_table(index=dims, columns='lifetime', values='profit', aggfunc='sum')
# Find the sum of profit with accumulation
result = result.fillna(0).cumsum(axis=1)
# Calculate cohort sizes
cohort_sizes = df.groupby(dims).agg({'customer_id': 'nunique'}).rename(columns={'customer_id': 'cohort_size'})
# Combine cohort sizes and profit table
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
# Calculate LTV: divide each "cell" in the row by the cohort size
result = result.div(result['cohort_size'], axis=0)
# Exclude all lifetimes exceeding the analysis horizon
result = result[['cohort_size'] + list(range(horizon_days))]
# Restore cohort sizes
result['cohort_size'] = cohort_sizes
# Collect a dataframe with user data and CAC values, adding parameters from dimensions
cac = df[['customer_id', 'cost'] + dims].drop_duplicates()
# Calculate average CAC by parameters from dimensions
cac = cac.groupby(dims).agg({'cost': 'mean'}).rename(columns={'cost': 'cac'})
# Calculate ROAS: divide LTV by CAC
roi = result.div(cac['cac'], axis=0)
# Remove rows with infinite ROI
# roi = roi.replace([np.inf, -np.inf], np.nan)
roi = roi[~roi['cohort_size'].isin([np.inf])]
# Restore cohort sizes in the ROI table
roi['cohort_size'] = cohort_sizes
# Add CAC to the ROI table
roi['cac'] = cac['cac']
# In the final table, leave cohort sizes, CAC and ROI in lifetimes not exceeding the analysis horizon
roi = roi[['cohort_size', 'cac'] + list(range(horizon_days))]
# Return LTV and ROAS tables
return result, roi
# Get LTV and ROI tables
grouped_result, grouped_roi = group_by_dimensions(raw_result, dimensions, horizon_days)
# For dynamic tables, remove 'cohort' from dimensions
if 'cohort' in dimensions:
dimensions = []
# Get LTV and ROI dynamics tables
in_time_result, in_time_roi = group_by_dimensions(raw_result, dimensions + ['first_dt'], horizon_days)
return (
raw_result, # raw data
grouped_result, # LTV table
in_time_result, # LTV dynamics table
grouped_roi, # ROI table
in_time_roi, # ROI dynamics table
)
# сreating datasets
ltv_raw, ltv_data, ltv_hist, roi_data, roi_hist = get_ltv_roi(df_customers, df, observation_date, horizon_days, ignore_horizon=True)
def apply_rolling_mean(dataframe, window_size):
"""
Apply a rolling mean to each column of the dataframe.
"""
for column_name in dataframe.columns:
dataframe[column_name] = dataframe[column_name].rolling(window_size).mean()
return dataframe
def plot_ltv_roi(ltv_data, ltv_hist, roi_data, roi_hist, horizon, window=7):
"""
Plot LTV and ROI for given customers and orders data.
"""
sns.set(rc={'axes.facecolor':'#DCDCDC'}, palette='tab10') # graph facecolor #e6e6fa
plt.figure(figsize=(10, 8))
# plt.rcParams.update({'font.size': 16})
# csfont = {'fontname':'Georgia', 'fontsize': 20}
# Prepare data for plotting
ltv = ltv_data.drop(columns=['cohort_size'])
ltv_hist = ltv_hist.drop(columns=['cohort_size'])[[horizon - 1]]
# cac_hist = roas_hist[['cac']]
roi = roi_data.drop(columns=['cohort_size', 'cac'])
roi_hist = roi_hist.drop(columns=['cohort_size', 'cac'])[[horizon - 1]]
# Function to create pivot table
def create_pivot(data, value, aggfunc='mean'):
columns = [name for name in data.index.names if name != 'first_dt']
return data.pivot_table(index='first_dt', columns=columns, values=value, aggfunc=aggfunc)
# Function to plot data
def plot_data(ax, data, xlabel, title, legend=True):
data.plot(grid=True, ax=ax)
# ax.grid(True, color='#ffffff', linestyle='-', linewidth=1.5, alpha=1)
if legend: plt.legend()
plt.xlabel(xlabel, **hfont)
plt.title(title, **csfont)
# Plot LTV curves
ax1 = plt.subplot(2, 2, 1)
plot_data(ax1, ltv.T, 'Lifetime', 'LTV')
# Plot LTV dynamics
ax2 = plt.subplot(2, 2, 2, sharey=ax1)
plot_data(ax2, apply_rolling_mean(create_pivot(ltv_hist, horizon - 1), window), 'Acquisition Date', f'LTV Dynamics on {horizon}-th Day')
plt.xticks(rotation=90)
# Plot ROAS curves
ax3 = plt.subplot(2, 2, 3)
plot_data(ax3, roi.T, 'Lifetime', 'ROI')
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
# Plot ROAS dynamics
ax4 = plt.subplot(2, 2, 4)
plot_data(ax4, apply_rolling_mean(create_pivot(roi_hist, horizon - 1), window), 'Acquisition Date', f'ROI Dynamics on {horizon}-th Day')
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
plt.xticks(rotation=90)
plt.tight_layout()
def plot_cac(cac_history, window=7):
"""
Plot CAC for given roi_hist[['cac']].
"""
plt.rcParams.update({'font.size': 14})
sns.set(rc={'axes.facecolor':'#DCDCDC'}, palette='tab10') # graph facecolor #e6e6fa
plt.grid(color='w', linestyle='-', linewidth=1.25, alpha=0.5)
from IPython.core.pylabtools import figsize
figsize(12, 4)
plt.figure()
columns = [name for name in cac_history.index.names if name not in ['first_dt']]
filtered_data = cac_history.pivot_table(
index='first_dt', columns=columns, values='cac', aggfunc='mean'
)
apply_rolling_mean(filtered_data, window).plot(grid=True)
plt.xlabel('', **hfont)
plt.ylabel('User Acquisition Cost', **hfont)
plt.title('Dynamics of User Acquisition Cost', **csfont)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plot_ltv_roi(ltv_data, ltv_hist, roi_data, roi_hist, horizon_days, window=14)
plot_cac(roi_hist[['cac']], window=14)
<Figure size 1200x400 with 0 Axes>
Сonclusions:
- LTV (Customer Lifetime Value): This indicator shows how much money one customer brings in over the entire period of cooperation with the company. The graphs show that LTV increases slightly over time. This means that customers continue to generate income after the first purchase. The peak is reached in December, and spikes are also visible in other months of the year.
- ROI (Return On Investment): Indicator of advertising effectiveness and return on advertising costs. If the ROI is greater than 1, then advertising is profitable; if it is less than 1, it is unprofitable. The graphs show that the ROI for all advertising channels is many times greater than 1, which indicates their effectiveness.
- CAC (Customer Acquisition Cost): This indicator shows how much it costs to attract one paying customer. The graph shows that CAC varies across time periods.
"""
Use data df (with organic channel) and all customers (&rev = 0)
"""
dimensions=['channel']
ltv_raw, ltv_data, ltv_hist, roi_data, roi_hist = get_ltv_roi(
df_customers, df, observation_date, horizon_days,
dimensions=dimensions, ignore_horizon=True
)
plot_ltv_roi(ltv_data, ltv_hist, roi_data, roi_hist, horizon_days, window=14)
plot_cac(roi_hist[['cac']], window=14)
<Figure size 1200x400 with 0 Axes>
Сonclusions:
- The graphs show that LTV is growing over time for all advertising channels. This means that customers continue to generate income after the first purchase.
- The graphs show that the ROI for all advertising channels is many times greater than 1, which indicates their effectiveness. For the g_ads_s channel, the ROAS is the least, which indicates its lowest efficiency compared to others.
- The graph shows that CAC varies depending on the advertising channel. The g_ads_s channel has the highest CAC, which, combined with low ROI, indicates its lowest efficiency.
- The best results are shown by g_ads_b (Google Ads Media), the ROI is noticeably higher, and the CAC is minimal than the others, with the highest LTV. However, there are noticeable gaps in ad impressions. What is this connected with?
- You can notice a peak in indicators for fb_ads in the August-September period. You can also notice the stability of this attraction channel.
In general, based on these data, we can conclude that the advertising channels fb_ads, g_ads_b and other_banner are the most effective, while the g_ads_s channel requires additional analysis and possible optimization.
"""
Use data df (with organic channel)
and df_customers_filtered (&rev = 0) by top 10 country total revenue
"""
country_top = paying_customers.groupby('country').agg({'customer_id': 'count', 'order_count': 'sum', 'total_rev': 'sum'})
country_top.rename(columns={'customer_id': 'p_customers', 'order_count': 'num_orders', 'total_rev': 'total_revenue'}, inplace=True)
country_top.sort_values('total_revenue', ascending=False, inplace=True)
top10_country_revenue = country_top.reset_index()
top10_country_revenue = top10_country_revenue['country'].head(10).to_list()
"""
OR
"""
# top_values_revenue = daily_arppu.groupby('country')['revenue'].sum().nlargest(10).index
# top_values_arppu = daily_arppu.groupby('country')['arppu'].mean().nlargest(5).index
# top_country = set(top_values_revenue).union(set(top_values_arppu))
# top_country = list(top_country)
df_customers_filtered = df_customers[df_customers['country'].isin(top10_country_revenue)] #top_country
dimensions = ['country']
ltv_raw, ltv_data, ltv_hist, roi_data, roi_hist = get_ltv_roi(
df_customers_filtered, df, observation_date, horizon_days,
dimensions=dimensions, ignore_horizon=True
)
plot_ltv_roi(ltv_data, ltv_hist, roi_data, roi_hist, horizon_days, window=7)
plot_cac(roi_hist[['cac']], window=7)
<Figure size 1200x400 with 0 Axes>
- The graphs show that LTV for the TOP10 countries by total revenue falls in January-February and increases in November-December. This means that there is an obvious decline in purchases after the new year and an increase before it.
- The graphs show that the ROI for the TOP10 countries in terms of total revenue is many times higher than 1, which indicates their effectiveness.
- The highest LTV and ROI indicators are observed for Lithuania, Australia, and the Netherlands, which indicates the high efficiency of attracting customers in this country.
Customer retention is a key metric that reflects a company's ability to retain customers over a certain period of time. It's an important aspect of business, as attracting new customers is typically more costly than retaining existing ones.
Customer retention can be calculated using the following formula:
where:
For example, if you had 100 customers at the beginning of the month and 90 customers at the end of the month, your retention rate would be 90%.
Customer retention is important for several reasons. Firstly, it helps increase profitability, as retaining existing customers costs less than acquiring new ones. Secondly, customer retention can contribute to business growth, as satisfied customers often become brand ambassadors and attract new customers. Finally, a high level of customer retention can serve as an indicator of the quality of a company's products or services.
However, it's important to remember that customer retention is not the only metric to consider. For a complete picture, it's also important to consider other metrics such as profit per customer, customer acquisition cost, and customer satisfaction level.
LTV (Lifetime Value) is a metric that shows the average amount of money a single customer brings over the entire period of their activity. It's an important metric that helps understand how much you can afford to spend on acquiring new customers and retaining existing ones.
The calculation of LTV using the cohort method involves the following steps:
The simple formula for LTV calculation is:
LTV is calculated as the difference between total revenue and total cost, divided by the number of customers in the cohort. This gives the average LTV value for each customer in the cohort.
So, the LTV calculated using the cohort method shows the average amount of profit each customer from a certain cohort brings over the entire period of their activity. This helps understand how customer profitability changes over time and how to effectively manage costs for customer acquisition and retention.
Explanations of the graphs:
Monthly Cohorts: User Retention: This graph shows user retention divided into monthly cohorts over the monthly periods since the order was held. The higher the percentage, the better the retention.
Monthly Cohorts: User LTV: This graph shows the User Lifetime Value (LTV) divided into monthly cohorts over the monthly periods since the order was placed. The higher the value, the greater the profitability from the user.
Average retention and Lifetime: This graph shows the average retention and customer lifetime between different categories
def analyze_cohort(dataframe, feature=None, ht=6):
if feature:
df = dataframe[["customer_id", "order_id", "month", "invoice_date", "revenue", "cost", feature]].drop_duplicates()
df["cohort"] = df.groupby("customer_id")["invoice_date"].transform("min").dt.to_period("M")
df["ltv"] = df["revenue"] - df["cost"]
cohort_data = df.groupby(["cohort", "month", feature]).agg(n_customers=("customer_id", "nunique"), ltv=("ltv", "sum")).reset_index(drop=False)
cohort_data["period_number"] = (cohort_data.month - cohort_data.cohort).apply(attrgetter("n"))
pivot_cohort = cohort_data.pivot_table(index=["cohort", feature], columns="period_number", values=["n_customers", "ltv"])
else:
df = dataframe[["customer_id", "order_id", "month", "invoice_date", "revenue", "cost"]].drop_duplicates()
df["cohort"] = df.groupby("customer_id")["invoice_date"].transform("min").dt.to_period("M")
df["ltv"] = df["revenue"] - df["cost"]
cohort_data = df.groupby(["cohort", "month"]).agg(n_customers=("customer_id", "nunique"), ltv=("ltv", "sum")).reset_index(drop=False)
cohort_data["period_number"] = (cohort_data.month - cohort_data.cohort).apply(attrgetter("n"))
pivot_cohort = cohort_data.pivot_table(index="cohort", columns="period_number", values=["n_customers", "ltv"])
cohort_size = pivot_cohort["n_customers"].iloc[:, 0]
retention_matrix = pivot_cohort["n_customers"].divide(cohort_size, axis=0)
ltv_matrix = pivot_cohort["ltv"].divide(cohort_size, axis=0)
with sns.axes_style("white"):
fig, ax = plt.subplots(2, 2, figsize=(10, ht*2), sharey=True, gridspec_kw={"width_ratios": [1, 11]})
sns.heatmap(retention_matrix, mask=retention_matrix.isnull(), annot=True, cbar=False, fmt=".0%", center=1, ax=ax[0, 1])
ax[0, 1].set_title("Monthly Cohorts: User Retention", fontsize=18, **csfont)
ax[0, 1].set(xlabel="# of periods", ylabel="")
sns.heatmap(ltv_matrix, mask=ltv_matrix.isnull(), annot=True, cbar=False, fmt=".1f", center=1, cmap='BrBG', ax=ax[1, 1])
ax[1, 1].set_title("Monthly Cohorts: LTV", fontsize=18, **csfont)
ax[1, 1].set(xlabel="# of periods", ylabel="")
white_cmap = mcolors.ListedColormap(["white"])
sns.heatmap(pd.DataFrame(cohort_size).rename(columns={0: "cohort_size"}), annot=True, cbar=False, fmt="g", cmap=white_cmap, ax=ax[0, 0])
sns.heatmap(pd.DataFrame(cohort_size).rename(columns={0: "cohort_size"}), annot=True, cbar=False, fmt="g", cmap=white_cmap, ax=ax[1, 0])
fig.tight_layout()
return retention_matrix, ltv_matrix
retention_matrix, ltv_matrix = analyze_cohort(df, feature=None)
Visually, the retention dynamics for all data shows general trends:
- Seasonality. We see repeating patterns in retention dynamics. This may indicate seasonality in your users' behavior. Retention is lowest in April and gradually increases reaching highs at the end of the year.
- Significant changes in retention are noticeable for customers who came in January 2022. They have a high retention rate.
- The highest retention is observed for January customers in November and December - 41 and 43%. Why is that? This could be due to changes in your product, marketing campaigns, or external environment for January customers.
From the ltv matrix data, we can draw the following conclusions:
- The maximum Lifetime Value (LTV) is observed for customers acquired in December 2022, amounting to 747.41. This indicates a high profitability of these customers for the company.
- The minimum LTV is observed for customers acquired in June 2022, in the second period (July), and is 39.66. This may indicate that these customers bring less revenue to the company.
- The average LTV of customers acquired in January 2022 steadily increases over time, reaching a peak in December 2022 with a value of 522.05. This indicates that these customers continue to bring revenue to the company throughout the year.
- For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers acquired in February 2022 reach a peak LTV in the tenth period with a value of 302.01.
retention_matrix, ltv_matrix = analyze_cohort(df, feature='channel', ht=16)
Retention in the context of customer acquisition channels by monthly cohorts has the following trends:
- For customers of the January cohort, periods 10 (November) and 11 (December) have the maximum retention values. The leaders are other_banner (73, 68%), organic (52, 72%), g_ads_s (62, 61%). What is the reason for customer returns at the end of the year?
- But if we take periods 2 and 3, then on the contrary, low retention is observed in organic and other_banner. In some areas it drops to 7%. What is causing this instability?
From the ltv matrix data by channel, we can draw the following conclusions:
- The maximum Lifetime Value (LTV) is observed for customers acquired through g_ads_s in December 2022, amounting to 1322.35. This indicates a high profitability of these customers for the company.
- The minimum LTV is observed for customers acquired through g_ads_s in November 2022, in the second period, and is 84.67. This may indicate that these customers bring less revenue to the company.
- The average LTV of customers acquired through fb_ads in January 2022 steadily increases over time, reaching a peak in the 11th period with a value of 348.81. This indicates that these customers continue to bring revenue to the company throughout the year.
- For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers acquired through fb_ads in February 2022 reach a peak LTV in the seventh period with a value of 456.63.
top_values_revenue = daily_arppu.groupby('country')['revenue'].sum().nlargest(5).index
top_values_arppu = daily_arppu.groupby('country')['arppu'].mean().nlargest(5).index
list_top_country = set(top_values_revenue).union(set(top_values_arppu))
list_top_country = list(list_top_country)
top_country = df.query('country== @list_top_country')
retention_matrix, ltv_matrix = analyze_cohort(top_country, feature='country', ht=26)
# print(ltv_matrix.to_string())
Retention in the context of country by monthly cohorts has the following trends:
- Customer retention for countries with large cohort sizes is relatively stable. Only the January cohort graduates.
- For small cohorts, the reduction value jumps. This is natural, since in some month the clients did not return, and in another month all of them may return.
From LTV, we can draw the following conclusions:
- The maximum Lifetime Value (LTV) is observed for customers from the Netherlands acquired in January 2022, amounting to 5117.45. This indicates a high profitability of these customers for the company.
- The minimum LTV is observed for customers from Canada acquired in August 2022, amounting to 16.28. This may indicate that these customers bring less revenue to the company.
- The average LTV of customers from Australia acquired in January 2022 steadily increases over time, peaking in the 9th period with a value of 1449.96. This indicates that these customers continue to bring revenue to the company throughout the year.
- For customers acquired in subsequent months, a similar trend is observed, although the overall LTV is slightly lower. For example, customers from Australia acquired in February 2022 reach a peak LTV in the 9th period with a value of 914.07.
Maximum peak values LTV for each country:
- Australia: is observed in December 2022 with a value of 3828.93.
- Canada: is observed in October 2022 with a value of 2030.30.
- EIRE: is observed in January 2022 with a value of 1992.10.
- Germany: is observed in February 2022 with a value of 1395.79.
- Netherlands: is observed in January 2022 with a value of 5117.44.
- Singapore: is observed in March 2022 with a value of 4566.99.
- Switzerland: is observed in April 2022 with a value of 1541.19.
- USA: is observed in June 2022 with a value of 562.98.
- United Kingdom: is observed in December 2022 with a value of 789.70.
def data_cohort_retention_lifetime(dataframe, feature):
"""
This function calculates the retention matrix, average retention,
and lifetime for a given feature.
"""
df = dataframe[["customer_id", "order_id", "month", "invoice_date", feature]].drop_duplicates()
df["cohort"] = df.groupby("customer_id")["invoice_date"].transform("min").dt.to_period("M")
cohort_data = df.groupby(["cohort", "month", feature]).agg(n_customers=("customer_id", "nunique")).reset_index(drop=False)
cohort_data["period_number"] = (cohort_data.month - cohort_data.cohort).apply(attrgetter("n"))
# Pivot the data to get a cohort-period matrix
pivot_cohort = cohort_data.pivot_table(index=["cohort", feature], columns="period_number",
values="n_customers", fill_value=0) #fill_value=0
# Calculate the retention matrix excluding the first period
retention_matrix_excl_first = pivot_cohort.drop(columns=0)
# Calculate the average retention for all periods by feature
avg_retention = retention_matrix_excl_first.mean(axis=1).groupby(level=1).mean()
# Calculate the lifetime by feature
lifetime = df.groupby(['customer_id', feature])['invoice_date'].max() - df.groupby(['customer_id', feature])['invoice_date'].min()
lifetime = lifetime.dt.days.groupby(level=1).mean()
# Calculate the size of each cohort
cohort_size = pivot_cohort.iloc[:, 0]
# Calculate the retention matrix by feature
retention_matrix = pivot_cohort.divide(cohort_size, axis=0)
return retention_matrix, avg_retention, lifetime
def plot_barh_retention_lifetime(retention, lifetime, feature, ht=5,
color1=c1, color2=c2, font_color=font_color, csfont=csfont, hfont=hfont):
# Select data
data1 = retention.reset_index()
data2 = lifetime.reset_index()
data1.sort_values(by=list(data1.columns)[1], axis=0, inplace=True)
data2.sort_values(by=list(data2.columns)[1], axis=0, inplace=True)
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, ht))
bar_width = 0.6
# Plot the retention graph
barh1 = ax1.barh(data1[feature], data1.iloc[:,1], color=color1, alpha=1, height=bar_width)
min_value = data1.iloc[:,1].min()
max_value = data1.iloc[:,1].max()
for i in ax1.patches:
ax1.text(i.get_width()+i.get_width()*0.01, i.get_y() + bar_width / 2,
f'{round(i.get_width(), 1)} %',
fontsize=14, fontweight='bold',
color=font_color, va='center')
i.set_color(color2 if i.get_width() == max_value else '#e05b4b' if i.get_width() == min_value else color1)
ax1.set(xlabel='', ylabel='', frame_on=False)
ax1.tick_params(left=False, bottom=False, labelbottom=False)
ax1.set_title(f'Average Retention by {feature.capitalize()}', fontsize=20, color=font_color, **csfont)
# Plot the lifetime graph
barh2 = ax2.barh(data2[feature], data2.iloc[:,1], color=color1, alpha=1, height=bar_width)
min_lifetime = data2.iloc[:,1].min()
max_lifetime = data2.iloc[:,1].max()
for i in ax2.patches:
ax2.text(i.get_width()+i.get_width()*0.01, i.get_y() + bar_width / 2,
str(round(i.get_width(), 1)),
fontsize=14, fontweight='bold',
color=font_color, va='center')
i.set_color(color2 if i.get_width() == max_lifetime else c3 if i.get_width() == min_lifetime else color1)
ax2.set(xlabel='', ylabel='', frame_on=False)
ax2.tick_params(left=False, bottom=False, labelbottom=False)
ax2.set_title(f'Lifetime total by {feature.capitalize()}', fontsize=20, color=font_color, **csfont)
plt.tight_layout()
plt.show()
retention_matrix, avg_retention, lifetime = data_cohort_retention_lifetime(df, 'channel')
plot_barh_retention_lifetime(avg_retention, lifetime, 'channel')
The average customer retention by acquisition channels showed:
- Organic customers have the highest retention (22.4%), which indicates the stability of the acquisition channel.
- Channel g_ads_s has the lowest average retention (11.5%). But at the same time, in November-December this channel was one of the leaders.
- The customer’s lifetime was distributed according to retention and ranged from 35 to 56 days.
retention_matrix, avg_retention, lifetime = data_cohort_retention_lifetime(df, 'country')
plot_barh_retention_lifetime(avg_retention, lifetime, 'country', ht=12)
- Average customer retention by country only objectively reflects the reality of countries with a sufficient number of customers. Because, with a small number of clients, in some periods the deduction will be zero.
- Longest customer lifetime in the Czech Republic (133). And the smallest is in Malta (61 days).
In a business context: A clustering algorithm is a method that helps to segment customers, that is, the process of classifying similar customers into the same segment. The clustering algorithm helps to better understand customers from the point of view of both static demographics and dynamic behavior. Customers with comparable characteristics often interact with the business in the same way, so the business can benefit from this methodology by creating an individual marketing strategy for each segment. Based on this, customers can be offered discounts, offers, promo codes, etc. As a simple example. The company wants to offer its customers discounts. Currently, they are studying the details of each customer and based on this information decide what offer to make to a particular customer. The company could potentially have thousands of customers. Does it make sense to look at the details of each customer separately and then make a decision? Of course not! This is a manual process and will take a huge amount of time. So, what can the company's marketing department do? One option is to divide customers into different groups. For example, the company can group customers by RFM analysis.
The abbreviation RFM stands for:
Recency - recency (how long ago your customers were at your procedures). A high recency score means that the customer has already formed a good impression of your brand, so he has recently visited you. Recency in the customer base can be viewed if you sort customers by the date of the last purchase.
Frequency - frequency (how often they buy from you). A high frequency score indicates that the customer likes your brand, your products and services, so he often returns to you. To calculate the frequency of visits, you need to divide the total number of purchases/visits by the number of months/days/years, etc.
Monetary - money (total spending). A high level of this indicator means that the customer likes to spend exactly with you.
last_date = df['invoice_dt'].max()+datetime.timedelta(days=1)
rfm= df.groupby('customer_id').agg({'invoice_dt': lambda date: (last_date - date.max()).days,
'order_id':'count',
'revenue':'sum'})\
.rename(columns={'invoice_dt':'Recency', 'order_id':'Frequency', 'revenue':'Monetary'})\
.sort_values('Monetary', ascending=False)
rfm = rfm[rfm['Monetary'] > 0]
rfm.head()
| Recency | Frequency | Monetary | |
|---|---|---|---|
| customer_id | |||
| 14646 | 3 | 74 | 280206.02 |
| 18102 | 15 | 60 | 259657.30 |
| 17450 | 17 | 40 | 144142.81 |
| 14156 | 21 | 46 | 78579.26 |
| 14911 | 1 | 124 | 70653.46 |
corr=rfm.corr()
fig = plt.figure(figsize=(4,3))
r = sns.heatmap(corr, annot=True)
r.set_title("Correlation ")
plt.show()
def rfm_plot(rfm, feature, maxlimit=None, bins=40):
plt.figure(figsize=(11, 4), dpi=100)
plt.subplots_adjust(wspace=0.3, hspace=1, left=0, right=1.1)
plt.suptitle(feature, fontsize=20, **csfont)
plt.subplot(121)
sns.boxplot(data=rfm, x=feature, orient='h', fliersize=0.4, color=c1)
plt.xlim(0, maxlimit)
plt.subplot(122)
sns.histplot(rfm[feature], kde=True, bins=bins, color=c2)
plt.xlim(0, maxlimit)
plt.show()
rfm_plot(rfm, 'Recency')
rfm_plot(rfm, 'Frequency', maxlimit=25, bins=100)
rfm_plot(rfm, 'Monetary', maxlimit=8000, bins=500)
rfm_plot(rfm, 'Monetary', maxlimit=2000, bins=2000)
quintiles = rfm[['Recency', 'Frequency', 'Monetary']].quantile([.2, .4, .6, .8]).to_dict()
quintiles
{'Recency': {0.2: 18.0, 0.4: 38.0, 0.6: 79.0, 0.8: 180.0},
'Frequency': {0.2: 1.0, 0.4: 2.0, 0.6: 2.0, 0.8: 4.0},
'Monetary': {0.2: 197.044,
0.4: 422.62800000000004,
0.6: 848.8559999999999,
0.8: 1890.7040000000006}}
print(f"Total customers >>>> {rfm.count()[0]}")
print(f"The number of customers whose purchase frequency = 1 or = 2 is >>>> {rfm[(rfm['Frequency'] >= 1) & (rfm['Frequency'] <= 2)]['Frequency'].count()}")
Total customers >>>> 5117 The number of customers whose purchase frequency = 1 or = 2 is >>>> 3158
def get_scores(rfm_df) -> pd.core.frame.DataFrame:
rfm_df['R'] = pd.qcut(rfm_df['Recency'], 5, labels=[5, 4, 3, 2, 1])
"""Using the Cut method we take into account the features of the frequency distribution"""
maxFrequency = rfm_df['Frequency'].max()
rfm_df['F'] = pd.cut(
rfm_df['Frequency'], bins=[0, 2, 4, 5, 10, maxFrequency], labels=[1, 2, 3, 4, 5]
)
rfm_df['M'] = pd.qcut(rfm_df['Monetary'], 5, labels=[1, 2, 3, 4, 5])
rfm_df["RFM_SCORE"] = rfm_df['R'].astype(str) + rfm_df['F'].astype(str)+rfm_df['M'].astype(str)
return rfm_df
rfm = get_scores(rfm)
segments = {
r'[1-2]1[1-4]': 'hibernating',
r'[1-2][2-5][1-4]': 'at_risk',
r'3[1-5][1-4]': 'need_attention',
r'[1-3][1-5]5': 'can\'t_loose',
r'[4-5]1[1-3]': 'new_customers',
r'[4-5]1[4-5]': 'rich_new_customers',
r'[4-5][2-3][1-4]': 'potential_loyalists',
r'[4-5][2-3]5': 'rich_potential_loyalists',
r'[4-5][4-5][1-4]': 'loyal_customers',
r'[4-5][4-5]5': 'rich_loyal_customers'
}
rfm['Segment'] = rfm['RFM_SCORE'].replace(segments, regex = True)
rfm.sample(5)
| Recency | Frequency | Monetary | R | F | M | RFM_SCORE | Segment | |
|---|---|---|---|---|---|---|---|---|
| customer_id | ||||||||
| 17948 | 162 | 1 | 358.56 | 2 | 1 | 2 | 212 | hibernating |
| 15363 | 22 | 1 | 326.00 | 4 | 1 | 2 | 412 | new_customers |
| 12656 | 32 | 3 | 2865.22 | 4 | 2 | 5 | 425 | rich_potential_loyalists |
| 13066 | 47 | 1 | 151.81 | 3 | 1 | 1 | 311 | need_attention |
| 15059 | 55 | 8 | 1173.81 | 3 | 4 | 4 | 344 | need_attention |
segments_counts = rfm['Segment'].value_counts().sort_values(ascending=True)
segments_counts
loyal_customers 149 rich_new_customers 153 rich_potential_loyalists 158 at_risk 247 can't_loose 352 rich_loyal_customers 433 potential_loyalists 522 new_customers 669 need_attention 850 hibernating 1584 Name: Segment, dtype: int64
def plot_segments(segments_counts, width=9, height=6):
fig, ax = plt.subplots()
fig.set_figwidth(width)
fig.set_figheight(height)
plt.title('Number of Customers by Segments', size = 16)
bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=c1)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)
ax.yaxis.set_tick_params(rotation=0, labelsize=14)
min_value = segments_counts.min()
max_value = segments_counts.max()
for i, bar in enumerate(bars):
value = bar.get_width()
if value == max_value:
bar.set_color(c2)
elif value == min_value:
bar.set_color(c3)
ax.text(value,
bar.get_y() + bar.get_height()/2,
'{:,} ({:}%)'.format(int(value),
int(value*100/segments_counts.sum())),
va='center', ha='left', fontsize=14, color=font_color, **hfont
)
plt.show()
plot_segments(segments_counts)
rfm_=rfm.copy()
rfm_['score'] = rfm['RFM_SCORE'].astype(int)
corr_segment = rfm_.corr()
fig = plt.figure(figsize=(4,3))
r = sns.heatmap(corr_segment, annot=True)
r.set_title("Correlation", fontsize=18, **csfont)
plt.show()
def plot_segment_distribution(rfm, segment='Segment', min_limit=0, max_limit=1):
x_percentiles = [min_limit, max_limit]
y_percentiles = [min_limit, max_limit]
plt.figure(figsize=(12, 6), dpi=150)
plt.subplots_adjust(wspace=0.3, hspace=1, left=0, right=1.1)
plt.suptitle(f"{segment} distribution", fontsize=18, **csfont, color=font_color)
plot_data = [
('Recency', 'Frequency'),
('Frequency', 'Monetary'),
('Recency', 'Monetary')
]
for i, (x, y) in enumerate(plot_data, start=1):
plt.subplot(1, 3, i)
sns.scatterplot(x=rfm[x], y=rfm[y], hue=rfm[segment], palette='tab10')
x_low, x_high = rfm[x].quantile(x_percentiles)
y_low, y_high = rfm[y].quantile(y_percentiles)
plt.xlim(x_low, x_high)
plt.ylim(y_low, y_high)
plt.show()
plot_segment_distribution(rfm)
plot_segment_distribution(rfm, min_limit=0.01, max_limit=0.99)
Conclusion on logical segmentation:
- We tried to take into account all three features. However, the Recency trait has the greatest influence (inverse correlation) (-0.88). Next are Frequency (0.37) and Monetary (0.17).
- Segmented into 10 groups. The largest group was the
hibernating(30 %). And the smallest number of simpleloyal_customers(2 %), becauserich_loyal_customersare allocated to a separate segment (8%).
K-Means Segmentation is a type of unsupervised machine learning algorithm used to divide data into distinct groups or clusters based on similarity in features. It's commonly used in customer segmentation to group customers with similar behaviors or characteristics.
rfm_=rfm.reset_index()
X = rfm_[['Recency', 'Frequency', 'Monetary']].copy(deep=True)
Using the elbow method, we will try to find a successful division into an acceptable number of clusters. We aim for 10 clusters to comply with manual division.¶
def plot_elbow_method(data_x, start=2, end=10, random_state=77):
scaler = StandardScaler()
scaled_X = scaler.fit_transform(data_x)
ssd = []
range_values = range(start, end)
for k in range_values:
model = KMeans(n_clusters=k, random_state=random_state)
model.fit(scaled_X)
ssd.append(model.inertia_) # Sum of squared distances from points to cluster centers
plt.plot(range_values, ssd, 'o--')
plt.title('Determining the number of clusters using the Elbow Method',
pad=20, fontsize=18, **csfont, color=font_color)
# Annotate each point with cluster number and SSD difference
ssd_diff = pd.Series(ssd).diff()
for i, diff in enumerate(ssd_diff):
plt.annotate(f'K {range_values[i]}\ndiff {diff:.2f}', (range_values[i], ssd[i]))
plt.show()
# return pd.DataFrame(data=zip(range_values, ssd_diff), columns=['Kluster', 'SSD_diff'])
# By choosing a random_state, you can get a noticeable selection of the elbow
plot_elbow_method(X, start=5, end=15, random_state=69)
The Silhouette Score method is a measure used to determine the optimal number of clusters in a dataset. It calculates the average distance between each sample in a cluster and all other points in the next nearest cluster. The score ranges from -1 to 1, where a high value indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. The optimal number of clusters is typically determined by the highest Silhouette Score.¶
def plot_silhouette_method(data_x, start=2, end=10, random_state=77):
scaler = StandardScaler()
scaled_X = scaler.fit_transform(data_x)
sils = []
range_values = range(start, end)
for k in range_values:
model = KMeans(n_clusters=k, random_state=random_state)
model.fit(scaled_X)
sils.append(silhouette_score(scaled_X, model.fit_predict(scaled_X)))
plt.plot(range_values, sils, 'o--')
plt.title('Determining the number of clusters using the Silhouette Score', pad=20,
fontsize=18, **csfont, color=font_color)
# Annotate each point with cluster number and Silhouette Score
sil_score = pd.Series(sils)
for i, score in enumerate(sil_score):
plt.annotate(f'K {range_values[i]}\nSSc {score:.3f}', (range_values[i], sil_score[i]))
plt.show()
# return pd.DataFrame(data=zip(range_values, sil_score), columns=['Kluster', 'Silhouette_score (-1>0<1)'])
plot_silhouette_method(X, start=2, end=15, random_state=69)
# from matplotlib import cm
range_n_clusters = [5, 10]
feature_1 = 0 #(0 is Recency, 1 is Frequency)
feature_2 = 2 #(2 is Monetary)
data_X = X.copy() #X.iloc[:, :-1]
scaler = StandardScaler()
scaled_X_ = scaler.fit_transform(data_X)
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, n_init="auto", random_state=69)
cluster_labels = clusterer.fit_predict(scaled_X_)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(scaled_X_, cluster_labels)
print(
"For n_clusters =",
n_clusters,
"The average silhouette_score is :",
silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(scaled_X_, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(
scaled_X_[:, feature_1], scaled_X_[:, feature_2], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(
centers[:, 0],
centers[:, 1],
marker="o",
c="white",
alpha=1,
s=200,
edgecolor="k",
)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")
ax2.set_title("The visualization of the clustered data.", fontsize=14, **csfont, color=font_color)
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(
"Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
% n_clusters,
fontsize=18,
fontweight="bold", **csfont
)
plt.show()
For n_clusters = 5 The average silhouette_score is : 0.597880334179938 For n_clusters = 10 The average silhouette_score is : 0.48041539327016913
The Silhouette Index plot shows that the best separation is achieved with 4-6 clusters. However, the number of clusters is insufficient for separating customers and they are disproportionate.
Therefore, it was decided to segment the customers into 10 groups¶
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
model = KMeans(n_clusters=10, random_state=69)
cluster_labels = model.fit_predict(scaled_X)
X['Cluster'] = cluster_labels
# rfm_['Cluster'] = cluster_labels
clusters_counts = X['Cluster'].value_counts().sort_values(ascending=True)
clusters_counts
3 2 8 2 5 8 6 15 2 74 4 504 1 616 7 742 0 1117 9 2037 Name: Cluster, dtype: int64
X.tail(5)
| Recency | Frequency | Monetary | Cluster | |
|---|---|---|---|---|
| 5112 | 346 | 1 | 0.85 | 1 |
| 5113 | 173 | 2 | 0.84 | 7 |
| 5114 | 333 | 1 | 0.84 | 1 |
| 5115 | 18 | 2 | 0.79 | 9 |
| 5116 | 7 | 2 | 0.42 | 9 |
X.query('Cluster==3 | Cluster==8 | Cluster==5')
| Recency | Frequency | Monetary | Cluster | |
|---|---|---|---|---|
| 0 | 3 | 74 | 280206.02 | 3 |
| 1 | 15 | 60 | 259657.30 | 3 |
| 4 | 1 | 124 | 70653.46 | 8 |
| 9 | 8 | 94 | 57788.65 | 5 |
| 19 | 4 | 51 | 34684.40 | 5 |
| 23 | 1 | 198 | 30169.98 | 8 |
| 24 | 1 | 62 | 29983.52 | 5 |
| 29 | 8 | 56 | 25291.20 | 5 |
| 34 | 5 | 65 | 21443.66 | 5 |
| 75 | 3 | 83 | 10699.46 | 5 |
| 106 | 4 | 54 | 8366.58 | 5 |
| 128 | 2 | 61 | 7466.56 | 5 |
corr_cluster = X.corr()
fig = plt.figure(figsize=(4,3))
r = sns.heatmap(corr_cluster, annot=True)
r.set_title("Correlation ", **csfont)
plt.show()
def autolabel(rects, fmt='.2f'):
# attach some text labels
for rect in rects:
height = rect.get_height()
rect.axes.annotate(f'{{:{fmt}}}'.format(height),
xy=(rect.get_x()+rect.get_width()/2., height),
xytext=(0, 3), textcoords='offset points',
ha='center', va='bottom')
labels = corr_cluster['Cluster'].iloc[:-1].sort_values().index
cluster_values = corr_cluster['Cluster'].iloc[:-1].sort_values().values
score_values = corr_segment['score'].iloc[:-1].sort_values().values
x = np.arange(len(labels)) # the label locations
width = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
rects1 = ax.bar(x - width/2, cluster_values, width, label='Cluster', color=c1)
rects2 = ax.bar(x + width/2, score_values, width, label='Score', color=c2)
ax.set_ylabel('Correlation')
ax.set_title('Correlation by Cluster and Score', fontsize=18, **csfont, color=font_color)
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
# ax = plt.gca()
autolabel(ax.patches, '.3f')
fig.tight_layout()
plt.show()
plot_segment_distribution(X, segment='Cluster', min_limit=0.01, max_limit=0.99)
- When clustering our data with unsupervised machine learning using K-Means, a significant influence of Recency is noticed. However, the other two features are not sufficiently taken into account. This is due to the clustered arrangement of data by Recency, which is obvious, as these are dates. There are large outliers for the Frequency and Monetary features
def plot_3d_clusters(data, segment='Segment', min_limit=0, max_limit=1):
from sklearn.preprocessing import LabelEncoder
rfm_=data.copy()
le = LabelEncoder()
rfm_[segment] = le.fit_transform(rfm_[segment])
x_percentiles = [min_limit, max_limit]
y_percentiles = [min_limit, max_limit]
z_percentiles = [min_limit, max_limit]
fig = plt.figure(figsize=(8,8),dpi=100)
plt.get_cmap('tab10')
ax = fig.add_subplot(projection='3d')
xs = rfm_['Recency']
ys = rfm_['Frequency']
zs = rfm_['Monetary']
x_low, x_high = xs.quantile(x_percentiles)
y_low, y_high = ys.quantile(y_percentiles)
z_low, z_high = zs.quantile(z_percentiles)
ax.set_xlim([x_low, x_high])
ax.set_ylim([y_low, y_high])
ax.set_zlim([z_low, z_high])
scatter_legends = []
segments = data[segment].unique()
for seg in segments:
scatter = ax.scatter(xs[data[segment] == seg],
ys[data[segment] == seg],
zs[data[segment] == seg], s=12, cmap='tab10')
scatter_legends.append(scatter)
plt.title(f"{segment} 3-D. K={rfm_[segment].nunique()}", fontsize=16, color=font_color, **csfont)
ax.set_xlabel('Recency', fontweight ='bold')
ax.set_ylabel('Frequency', fontweight ='bold')
ax.set_zlabel('Monetary', fontweight ='bold')
ax.legend(scatter_legends, segments, loc='upper right')
plt.show()
plot_3d_clusters(rfm, segment='Segment', min_limit=0.01, max_limit=0.99)
plot_3d_clusters(X, segment='Cluster', min_limit=0.01, max_limit=0.99)
- As previously concluded, we cannot focus solely on Recency. We need to take into account the other two features. Logical segmentation is better suited for this.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm used for data analysis and pattern recognition. It groups data points based on their density, identifying clusters of high-density regions and classifying outliers as noise
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X[['Recency', 'Frequency', 'Monetary']])
eps_sample = []
labels = 10
for eps in np.linspace(0.01, 0.5, 50): #default=0.5
for min_samples in range(3, 11): #default=5
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
dbscan.fit_predict(scaled_X) #scaled_X #X[['Recency', 'Frequency', 'Monetary']]
if len(np.unique(dbscan.labels_)) == labels:
eps_sample.append([eps, min_samples])
print(f"For {labels} labels find eps={eps}, min_samples={min_samples}, sum <-1> = {np.sum(dbscan.labels_==-1)}")
else:
pass
if eps_sample == []:
print("Hyperparameters not found")
print('Finish')
# eps_sample
For 10 labels find eps=0.17, min_samples=4, sum <-1> = 230 Finish
# For 10 labels find eps=0.17, min_samples=4, sum <-1> = 230
dbscan = DBSCAN(eps=0.17, min_samples=4)
dbscan_labels = dbscan.fit_predict(scaled_X)
X['dbscan_cluster'] = dbscan_labels
clusters_counts = X['dbscan_cluster'].value_counts().sort_values(ascending=True)
clusters_counts
5 3 1 4 7 4 8 4 6 5 3 6 4 15 2 23 -1 230 0 4823 Name: dbscan_cluster, dtype: int64
X.head(5)
| Recency | Frequency | Monetary | Cluster | dbscan_cluster | |
|---|---|---|---|---|---|
| 0 | 3 | 74 | 280206.02 | 3 | -1 |
| 1 | 15 | 60 | 259657.30 | 3 | -1 |
| 2 | 17 | 40 | 144142.81 | 6 | -1 |
| 3 | 21 | 46 | 78579.26 | 6 | -1 |
| 4 | 1 | 124 | 70653.46 | 8 | -1 |
X.tail(5)
| Recency | Frequency | Monetary | Cluster | dbscan_cluster | |
|---|---|---|---|---|---|
| 5112 | 346 | 1 | 0.85 | 1 | 0 |
| 5113 | 173 | 2 | 0.84 | 7 | 0 |
| 5114 | 333 | 1 | 0.84 | 1 | 0 |
| 5115 | 18 | 2 | 0.79 | 9 | 0 |
| 5116 | 7 | 2 | 0.42 | 9 | 0 |
X.query('dbscan_cluster==-1').sample(10)
| Recency | Frequency | Monetary | Cluster | dbscan_cluster | |
|---|---|---|---|---|---|
| 337 | 96 | 1 | 4366.78 | 0 | -1 |
| 105 | 7 | 13 | 8373.91 | 4 | -1 |
| 42 | 135 | 1 | 16677.26 | 0 | -1 |
| 172 | 8 | 22 | 6543.72 | 2 | -1 |
| 2286 | 151 | 5 | 712.24 | 7 | -1 |
| 106 | 4 | 54 | 8366.58 | 5 | -1 |
| 109 | 172 | 3 | 8270.31 | 7 | -1 |
| 422 | 61 | 4 | 3742.20 | 9 | -1 |
| 21 | 15 | 34 | 31833.68 | 2 | -1 |
| 366 | 68 | 14 | 4167.50 | 4 | -1 |
corr_cluster = X.corr()
fig = plt.figure(figsize=(6,4))
r = sns.heatmap(corr_cluster, annot=True)
r.set_title("Correlation ", **csfont)
plt.show()
plot_segment_distribution(X, segment='dbscan_cluster', min_limit=0.01, max_limit=0.99)
plot_3d_clusters(X, segment='dbscan_cluster', min_limit=0.01, max_limit=0.99)
The DBSCAN algorithm did not accomplish our task. We were only able to select one pair of hyperparameters for dividing into 10 clusters. As a result, we got one cluster of a huge size and the rest were small. The DBSCAN clustering method did not suit us.
Let's evaluate our clustering models with special indexes. This is interesting. However, we will not rely on the index indicators, as we have settled on manual logical segmentation of customers.
We will use the following indexes:
Silhouette Score, which can take a value from -1 to +1. The closer to 1, the better. A good indicator is considered to be above 0.6.
Calinski Harabasz Score. The larger the value, the better when comparing models.
Davies Bouldin Score. The smaller the value, the better in evaluating clustering models.
def evaluate_model(data, labels, model_name):
print(f' {model_name} Model Evaluation '.center(70, '='))
print(f'Number of Observations: {data.shape[0]}')
print(f'Number of Segments: {labels.nunique()}')
print(f'Silhouette Score: {round(silhouette_score(data, labels), 3)}')
print(f'Calinski Harabasz Score: {round(calinski_harabasz_score(data, labels), 3)}')
print(f'Davies Bouldin Score: {round(davies_bouldin_score(data, labels), 3)} \n{70*"="}')
evaluate_model(rfm[['R', 'F', 'M']], rfm['Segment'], 'RFM')
======================== RFM Model Evaluation ======================== Number of Observations: 5117 Number of Segments: 10 Silhouette Score: 0.278 Calinski Harabasz Score: 1946.945 Davies Bouldin Score: 1.177 ======================================================================
df_scaled_X = pd.DataFrame(scaled_X, columns=['Recency_scaled', 'Frequency_scaled', 'Monetary_scaled'])
df_scaled_X['Cluster'] = cluster_labels
evaluate_model(df_scaled_X[['Recency_scaled', 'Frequency_scaled', 'Monetary_scaled']],
df_scaled_X['Cluster'], 'K-Means')
====================== K-Means Model Evaluation ====================== Number of Observations: 5117 Number of Segments: 10 Silhouette Score: 0.423 Calinski Harabasz Score: 5283.429 Davies Bouldin Score: 0.691 ======================================================================
# df_scaled_X = pd.DataFrame(scaled_X, columns=['Recency_scaled', 'Frequency_scaled', 'Monetary_scaled'])
df_scaled_X['dbscan_cluster'] = dbscan_labels
evaluate_model(df_scaled_X[['Recency_scaled', 'Frequency_scaled', 'Monetary_scaled']],
df_scaled_X['dbscan_cluster'], 'DBSCAN')
====================== DBSCAN Model Evaluation ======================= Number of Observations: 5117 Number of Segments: 10 Silhouette Score: -0.261 Calinski Harabasz Score: 117.605 Davies Bouldin Score: 1.726 ======================================================================
Based on the results of evaluating clustering models using indexes, we can draw the following conclusions:
- DBSCAN has the lowest indicators, which is objective.
- The K-means algorithm showed the best clustering results.
- Logical segmentation has intermediate indicators, but it is the most suitable for work. Moreover, we do not need to interpret the value of clusters obtained during machine division.
rfm_agg = rfm.groupby('Segment').size().reset_index(name='count')
agg_metrics = rfm[['Recency','Monetary','Frequency','Segment']]\
.groupby('Segment').agg(['mean','std','max','min'])
rfm_agg = rfm_agg.join(agg_metrics, on='Segment')
rfm_agg.sort_values('count', ascending=False).style.background_gradient(cmap = "BrBG")
| Segment | count | ('Recency', 'mean') | ('Recency', 'std') | ('Recency', 'max') | ('Recency', 'min') | ('Monetary', 'mean') | ('Monetary', 'std') | ('Monetary', 'max') | ('Monetary', 'min') | ('Frequency', 'mean') | ('Frequency', 'std') | ('Frequency', 'max') | ('Frequency', 'min') | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | hibernating | 1584 | 203.191288 | 79.672597 | 354 | 80 | 434.552235 | 416.630291 | 1878.670000 | 0.840000 | 1.257576 | 0.437437 | 2 | 1 |
| 4 | need_attention | 850 | 57.484706 | 12.343737 | 79 | 39 | 583.045260 | 474.417635 | 1887.960000 | 1.650000 | 2.174118 | 1.455355 | 9 | 1 |
| 5 | new_customers | 669 | 22.055306 | 9.239896 | 38 | 1 | 306.330149 | 208.771451 | 848.550000 | 0.420000 | 1.482810 | 0.500078 | 2 | 1 |
| 6 | potential_loyalists | 522 | 20.762452 | 9.359174 | 38 | 1 | 928.864176 | 454.614851 | 1887.200000 | 2.950000 | 3.779693 | 0.765956 | 5 | 3 |
| 7 | rich_loyal_customers | 433 | 15.293303 | 8.033459 | 38 | 1 | 8608.237090 | 21575.501257 | 280206.020000 | 1895.090000 | 14.803695 | 15.214692 | 198 | 6 |
| 1 | can't_loose | 352 | 108.110795 | 74.374479 | 348 | 39 | 4048.835287 | 5845.689602 | 66076.590000 | 1891.390000 | 3.792614 | 3.002307 | 17 | 1 |
| 0 | at_risk | 247 | 141.886640 | 54.231393 | 320 | 80 | 840.317939 | 481.393969 | 1872.700000 | 2.340000 | 3.542510 | 0.917986 | 8 | 3 |
| 9 | rich_potential_loyalists | 158 | 19.405063 | 9.687498 | 38 | 1 | 3997.760443 | 2941.729837 | 18780.610000 | 1904.050000 | 3.905063 | 0.843015 | 5 | 3 |
| 8 | rich_new_customers | 153 | 18.803922 | 9.888895 | 38 | 1 | 3087.390261 | 2499.333876 | 12681.580000 | 852.150000 | 1.725490 | 0.447733 | 2 | 1 |
| 3 | loyal_customers | 149 | 17.255034 | 8.161507 | 38 | 1 | 1337.248389 | 372.762285 | 1887.650000 | 70.020000 | 7.510067 | 2.990113 | 39 | 6 |
Conclusions on evaluating logical customer segmentation:
- The "hibernating" segment contains the largest number of customers (1584), but the average Recency in this segment is the highest (203 days), indicating that customers from this segment have not made purchases for a long time. The average Monetary in this segment is 434.55, and the average Frequency is 1.26.
- The "need_attention" segment contains 850 customers with an average Recency of 57 days. This indicates that customers from this segment have made purchases relatively recently. The average Monetary in this segment is 583.04, and the average Frequency is 2.17.
- The "new_customers" segment contains 669 customers with the lowest average Recency (22 days), indicating that these customers made their last purchase very recently. The average Monetary in this segment is 306.33, and the average Frequency is 1.48.
- The "potential_loyalists" segment contains 522 customers with an average Recency of 20 days. This indicates that customers from this segment have made purchases relatively recently. The average Monetary in this segment is 928.86, and the average Frequency is 3.78.
- The "rich_loyal_customers" segment contains 433 customers with an average Recency of 15 days. This indicates that customers from this segment have made purchases very recently. The average Monetary in this segment is 8608.24, which is the highest among all segments, and the average Frequency is 14.80.
- The "can't_loose" segment includes 352 customers with an average Recency of 108 days. This indicates that customers from this segment have not made purchases for some time. The average Monetary in this segment is 4048.84, which is quite high, and the average Frequency is 3.79.
- The "at_risk" segment contains 247 customers with a high average Recency of 141 days. This indicates that customers from this segment have not made purchases for a long time. The average Monetary in this segment is 840.32, and the average Frequency is 3.54.
- The "rich_potential_loyalists" segment includes 158 customers with a low average Recency of 19 days, indicating that these customers made their last purchase very recently. The average Monetary in this segment is 3997.76, which is a high value, and the average Frequency is 3.91.
- The "rich_new_customers" segment contains 153 customers with a low average Recency of 18 days. This indicates that these customers made their last purchase very recently. The average Monetary in this segment is 3087.39, and the average Frequency is 1.73.
- The "loyal_customers" segment includes 149 customers with the lowest average Recency of 17 days, indicating that these customers made their last purchase very recently. The average Monetary in this segment is 1337.25, and the average Frequency is 7.51, which is one of the highest values among all segments.
Overall, these data can be useful for developing customer relationship management strategies and marketing campaigns.
Beta Geometric / Negative Binomial Distribution (BG/NBD) models are predicated on the idea that each customer's transaction count adheres to a Poisson process, with a gamma distribution representing the variability in transaction rates among customers. These assumptions enable us to employ the Negative Binomial Distribution (NBD) to model the transaction count made by a customer during their "lifetime". The BG/NBD model can be constructed using the BetaGeoFitter and lifetimes packages.
# last_date = df['invoice_dt'].max()+datetime.timedelta(days=1)
ltv_customer = df.groupby('customer_id').agg({'invoice_dt': [lambda date: (date.max() - date.min()).days,
lambda date: (last_date - date.min()).days],
'order_id':'count', #'nunique'
'revenue':'sum'})
ltv_customer.columns = ltv_customer.columns.droplevel(0)
ltv_customer.columns = ['lifetime', 'tenure', 'frequency', 'monetary']
ltv_customer= ltv_customer[ltv_customer['monetary'] > 0].sort_values('monetary', ascending=False)
ltv_customer.head()
| lifetime | tenure | frequency | monetary | |
|---|---|---|---|---|
| customer_id | ||||
| 14646 | 341 | 344 | 74 | 280206.02 |
| 18102 | 305 | 320 | 60 | 259657.30 |
| 17450 | 330 | 347 | 40 | 144142.81 |
| 14156 | 303 | 324 | 46 | 78579.26 |
| 14911 | 348 | 349 | 124 | 70653.46 |
#Average Order Value
ltv_customer['arppu'] = ltv_customer['monetary'] / ltv_customer['frequency']
#Recency & Tenure
ltv_customer['lifetime_week'] = ltv_customer['lifetime'] / 7
ltv_customer["tenure_week"] = ltv_customer["tenure"] / 7
#Frequency
ltv_customer = ltv_customer[(ltv_customer['frequency'] > 1)]
ltv_customer.tail(2)
| lifetime | tenure | frequency | monetary | arppu | lifetime_week | tenure_week | |
|---|---|---|---|---|---|---|---|
| customer_id | |||||||
| 12069 | 144 | 162 | 2 | 0.79 | 0.395 | 20.571429 | 23.142857 |
| 15162 | 72 | 79 | 2 | 0.42 | 0.210 | 10.285714 | 11.285714 |
BGF = BetaGeoFitter(penalizer_coef=0.001) # avoid overfitting
BGF.fit(ltv_customer['frequency'], ltv_customer['lifetime_week'], ltv_customer['tenure_week'])
<lifetimes.BetaGeoFitter: fitted with 3112 subjects, a: 0.15, alpha: 14.35, b: 2.58, r: 2.49>
BGF.conditional_expected_number_of_purchases_up_to_time(
4.29, ltv_customer['frequency'], ltv_customer['lifetime_week'], ltv_customer['tenure_week']
).sort_values(ascending=False).head(10).to_frame('Expected Number of Transactions for 1 month').reset_index()
| customer_id | Expected Number of Transactions for 1 month | |
|---|---|---|
| 0 | 12748 | 13.197681 |
| 1 | 14911 | 8.396680 |
| 2 | 13089 | 6.304528 |
| 3 | 12971 | 5.628880 |
| 4 | 14646 | 5.125464 |
| 5 | 17841 | 4.465875 |
| 6 | 18102 | 4.338848 |
| 7 | 15311 | 4.257176 |
| 8 | 14606 | 4.161785 |
| 9 | 13408 | 4.056260 |
# Expected Number of Transactions for 12 month (52 week)
ltv_customer['num_purchases_12M'] = BGF.predict(
50.5, ltv_customer['frequency'], ltv_customer['lifetime_week'], ltv_customer['tenure_week']
)
# Plot a figure with period actual and predicted transactions.
# sns.set(context='notebook', style='whitegrid', palette='Set2', font='sans-serif', font_scale=1, color_codes=False, rc=None)
# def autolabel(rects, fmt='.2f'):
# # attach some text labels
# for rect in rects:
# height = rect.get_height()
# rect.axes.annotate(f'{{:{fmt}}}'.format(height),
# xy=(rect.get_x()+rect.get_width()/2., height),
# xytext=(0, 3), textcoords='offset points',
# ha='center', va='bottom')
plot_period_transactions(BGF, max_frequency=7)
ax = plt.gca()
autolabel(ax.patches, '.0f')
plt.title('Frequency of Repeat Transactions', fontsize=18, **csfont, color=font_color)
plt.show()
The Gamma-Gamma model is a statistical approach used in calculating the monetary value of transactions in customer lifetime value (CLTV) predictions. It is often used in conjunction with the BG/NBD model (Beta-Geometric/Negative Binomial Distribution model) for predicting future transactions.
The Gamma-Gamma model assumes that there is no relationship between the monetary value and the purchase frequency. In other words, the amount a customer spends on a purchase is independent of how often they make a purchase.
The model is called "Gamma-Gamma" because it assumes that the transaction value of customers follows a Gamma distribution, and the average transaction value also follows a Gamma distribution.
The model provides an estimate of the average transaction value for each customer, which can be used to predict the customer's lifetime value by multiplying it by the predicted number of future transactions from the BG/NBD model.
GGF = GammaGammaFitter(penalizer_coef=0.01)
GGF.fit(ltv_customer['frequency'], ltv_customer['arppu'])
# Expected Average Profit
ltv_customer['expected_arppu'] = GGF.conditional_expected_average_profit(ltv_customer['frequency'],
ltv_customer['arppu'])
Make a 1-, 12-month CLTV prediction for customers.
Interpret and evaluate the results you have obtained.
cltv_1 = GGF.customer_lifetime_value(BGF,
ltv_customer['frequency'],
ltv_customer['lifetime_week'],
ltv_customer['tenure_week'],
ltv_customer['arppu'],
time=1, # 1 months.
freq="W", # T's frequency information.
discount_rate=0.01)
cltv_1 = cltv_1.reset_index()
ltv_customer = ltv_customer.merge(cltv_1, on='customer_id', how="left")
ltv_customer.rename(columns={'clv': 'clv_1_month'}, inplace=True)
ltv_customer.sort_values(by="clv_1_month", ascending=False).head(5)
| customer_id | lifetime | tenure | frequency | monetary | arppu | lifetime_week | tenure_week | num_purchases_12M | expected_arppu | clv_1_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14646 | 341 | 344 | 74 | 280206.02 | 3786.567838 | 48.714286 | 49.142857 | 57.820429 | 3797.748413 | 19518.375981 |
| 1 | 18102 | 305 | 320 | 60 | 259657.30 | 4327.621667 | 43.571429 | 45.714286 | 48.848890 | 4343.384555 | 18896.679653 |
| 10 | 15401 | 95 | 142 | 2 | 54632.26 | 27316.130000 | 13.571429 | 20.285714 | 5.424462 | 30642.676728 | 15252.563397 |
| 13 | 15305 | 181 | 221 | 2 | 52018.24 | 26009.120000 | 25.857143 | 31.571429 | 4.286881 | 29176.585402 | 11323.742484 |
| 2 | 17450 | 330 | 347 | 40 | 144142.81 | 3603.570250 | 47.142857 | 49.571429 | 31.427758 | 3623.307912 | 10122.098056 |
ltv_customer.set_index('customer_id', inplace=True)
cltv_12 = GGF.customer_lifetime_value(BGF,
ltv_customer['frequency'],
ltv_customer['lifetime_week'],
ltv_customer['tenure_week'],
ltv_customer['arppu'],
time=12, # 12 months.
freq="W", # T's frequency information.
discount_rate=0.01)
cltv_12 = cltv_12.reset_index()
ltv_customer = ltv_customer.merge(cltv_12, on='customer_id', how="left")
ltv_customer.rename(columns={'clv': 'clv_12_month'}, inplace=True)
ltv_customer.sort_values(by="clv_12_month", ascending=False).head(5)
| customer_id | lifetime | tenure | frequency | monetary | arppu | lifetime_week | tenure_week | num_purchases_12M | expected_arppu | clv_1_month | clv_12_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14646 | 341 | 344 | 74 | 280206.02 | 3786.567838 | 48.714286 | 49.142857 | 57.820429 | 3797.748413 | 19518.375981 | 212563.532952 |
| 1 | 18102 | 305 | 320 | 60 | 259657.30 | 4327.621667 | 43.571429 | 45.714286 | 48.848890 | 4343.384555 | 18896.679653 | 205380.621274 |
| 10 | 15401 | 95 | 142 | 2 | 54632.26 | 27316.130000 | 13.571429 | 20.285714 | 5.424462 | 30642.676728 | 15252.563397 | 160882.404607 |
| 13 | 15305 | 181 | 221 | 2 | 52018.24 | 26009.120000 | 25.857143 | 31.571429 | 4.286881 | 29176.585402 | 11323.742484 | 121065.155744 |
| 2 | 17450 | 330 | 347 | 40 | 144142.81 | 3603.570250 | 47.142857 | 49.571429 | 31.427758 | 3623.307912 | 10122.098056 | 110229.998652 |
def plot_distribution(data, x1, x2, bins=200, kde=True, xlim=None):
plt.figure(figsize=(10, 6), dpi=100)
sns.histplot(data=data, x=x1, bins=bins, color=c1, kde=kde, label=x1)
sns.histplot(data=data, x=x2, bins=bins, color='#e05b4b', kde=kde, label=x2)
plt.title(f'Frequency distribution {x1} и {x2}', fontsize=18, **csfont, color=font_color)
# plt.legend()
if xlim:
plt.xlim(xlim)
sum_x1 = data[x1].sum()
sum_x2 = data[x2].sum()
mean_x1 = data[x1].mean()
mean_x2 = data[x2].mean()
# Add text to the plot
plt.text(0.7, 0.98, f'Sum {x1}: {sum_x1:.0f}\nMean {x1}: {mean_x1:.0f}',
transform=plt.gca().transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor=c1, alpha=0.95))
plt.text(0.7, 0.88, f'Sum {x2}: {sum_x2:.0f}\nMean {x2}: {mean_x2:.0f}',
transform=plt.gca().transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor=c3, alpha=0.95))
plt.show()
plot_distribution(ltv_customer, 'monetary', 'clv_12_month', xlim=(0, 5000), bins=1000)
plot_distribution(ltv_customer, 'arppu', 'expected_arppu', xlim=(0, 2000), bins=1000)
# print(ltv_customer.to_string())
We can draw the following conclusions:
- The projected total revenue for the upcoming year is 10048406, which is 34.7% higher than the total revenue in 2022, which was 7454080. This could indicate expected business growth or increased customer activity.
- The average ARPPU (Average Revenue Per Paying User) in 2022 was $462, while the projected ARPPU for the upcoming year is $501. This means an expected increase in the average revenue per paying user by 8.4%.
These forecasts can be useful for budget planning, marketing campaigns, and customer relationship management strategies. However, it's important to remember that these are predictions, and actual results may vary depending on many factors.
# from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(ltv_customer[['clv_12_month']])
ltv_customer['scaled_clv'] = scaler.transform(ltv_customer[['clv_12_month']])
ltv_customer.sort_values(by="scaled_clv", ascending=False).head()
| customer_id | lifetime | tenure | frequency | monetary | arppu | lifetime_week | tenure_week | num_purchases_12M | expected_arppu | clv_1_month | clv_12_month | scaled_clv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14646 | 341 | 344 | 74 | 280206.02 | 3786.567838 | 48.714286 | 49.142857 | 57.820429 | 3797.748413 | 19518.375981 | 212563.532952 | 1.000000 |
| 1 | 18102 | 305 | 320 | 60 | 259657.30 | 4327.621667 | 43.571429 | 45.714286 | 48.848890 | 4343.384555 | 18896.679653 | 205380.621274 | 0.966208 |
| 10 | 15401 | 95 | 142 | 2 | 54632.26 | 27316.130000 | 13.571429 | 20.285714 | 5.424462 | 30642.676728 | 15252.563397 | 160882.404607 | 0.756865 |
| 13 | 15305 | 181 | 221 | 2 | 52018.24 | 26009.120000 | 25.857143 | 31.571429 | 4.286881 | 29176.585402 | 11323.742484 | 121065.155744 | 0.569545 |
| 2 | 17450 | 330 | 347 | 40 | 144142.81 | 3603.570250 | 47.142857 | 49.571429 | 31.427758 | 3623.307912 | 10122.098056 | 110229.998652 | 0.518570 |
# Creation of Segments by CLTV
ltv_customer['ltv_segment'] = pd.qcut(ltv_customer['scaled_clv'], 5, labels=["E", "D", "C", "B", "A"])
ltv_customer.head()
| customer_id | lifetime | tenure | frequency | monetary | arppu | lifetime_week | tenure_week | num_purchases_12M | expected_arppu | clv_1_month | clv_12_month | scaled_clv | ltv_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14646 | 341 | 344 | 74 | 280206.02 | 3786.567838 | 48.714286 | 49.142857 | 57.820429 | 3797.748413 | 19518.375981 | 212563.532952 | 1.000000 | A |
| 1 | 18102 | 305 | 320 | 60 | 259657.30 | 4327.621667 | 43.571429 | 45.714286 | 48.848890 | 4343.384555 | 18896.679653 | 205380.621274 | 0.966208 | A |
| 2 | 17450 | 330 | 347 | 40 | 144142.81 | 3603.570250 | 47.142857 | 49.571429 | 31.427758 | 3623.307912 | 10122.098056 | 110229.998652 | 0.518570 | A |
| 3 | 14156 | 303 | 324 | 46 | 78579.26 | 1708.244783 | 43.285714 | 46.285714 | 37.080847 | 1716.411802 | 5667.372143 | 61609.561042 | 0.289835 | A |
| 4 | 14911 | 348 | 349 | 124 | 70653.46 | 569.785968 | 49.714286 | 49.857143 | 94.771372 | 570.810735 | 4806.007259 | 52366.291176 | 0.246350 | A |
rfm_=rfm.reset_index(inplace=True)
rfm_final = rfm[['customer_id', 'Recency', 'Frequency', 'Monetary', 'RFM_SCORE', 'Segment']]\
.merge(ltv_customer[['customer_id', 'ltv_segment']], on='customer_id', how="left")
rfm_final['ltv_segment'] = rfm_final['ltv_segment'].cat.add_categories('N')
rfm_final['ltv_segment'] = rfm_final['ltv_segment'].fillna('N')
rfm_final["agg_segment"] = rfm_final['Segment'].astype(str) + " (" + rfm_final['ltv_segment'].astype(str) + ")"
segments_counts = rfm_final['agg_segment'].value_counts().sort_values(ascending=True)
plot_segments(segments_counts, 12, 20)
seg_cust = ['hibernating (A)', 'loyal_customers (A)', 'new_customers (A)']
numbs = rfm_final.query("agg_segment== @seg_cust")['customer_id'].to_list()
ltv_customer.query("customer_id == @numbs")[['customer_id',
'lifetime',
'tenure',
'frequency', 'num_purchases_12M',
'monetary',
'clv_12_month',
'ltv_segment']]
| customer_id | lifetime | tenure | frequency | num_purchases_12M | monetary | clv_12_month | ltv_segment | |
|---|---|---|---|---|---|---|---|---|
| 958 | 13755 | 60 | 78 | 8 | 17.976188 | 1810.21 | 4053.800779 | A |
| 984 | 15473 | 46 | 75 | 7 | 15.782021 | 1771.20 | 3995.831947 | A |
| 1078 | 14164 | 10 | 95 | 2 | 4.796084 | 1631.77 | 4256.596853 | A |
| 1785 | 16347 | 25 | 51 | 2 | 8.505175 | 847.45 | 3927.275152 | A |
| 1854 | 16367 | 40 | 52 | 2 | 8.744164 | 800.67 | 3815.586740 | A |
| 1925 | 13030 | 8 | 26 | 2 | 10.172603 | 752.44 | 4172.547719 | A |
| 1935 | 17528 | 16 | 32 | 2 | 9.826569 | 743.49 | 3982.877538 | A |
Additionally, we will mark VIP customers. In this case, these will be customers who spent more than $40,000 in 2022.¶
rfm_final['VIP'] = rfm_final['Monetary'] > 40000
rfm_final.sample(5)
| customer_id | Recency | Frequency | Monetary | RFM_SCORE | Segment | ltv_segment | agg_segment | VIP | |
|---|---|---|---|---|---|---|---|---|---|
| 2073 | 15862 | 23 | 4 | 832.88 | 423 | potential_loyalists | D | potential_loyalists (D) | False |
| 2572 | 12521 | 157 | 1 | 599.68 | 213 | hibernating | N | hibernating (N) | False |
| 4379 | 15287 | 37 | 1 | 145.71 | 411 | new_customers | N | new_customers (N) | False |
| 3989 | 13858 | 72 | 1 | 216.17 | 312 | need_attention | N | need_attention (N) | False |
| 4593 | 16757 | 333 | 1 | 106.83 | 111 | hibernating | N | hibernating (N) | False |
def plot_pie_chart(rfm, groupby='Segment', sum='Monetary'):
fig, ax = plt.subplots(figsize=(6, 6))
total_monetary = rfm.groupby(groupby)[sum].sum()
segments = total_monetary.index
values = total_monetary.values
# colors = plt.cm.tab20c(np.arange(len(values)))
min_index = np.argmin(values)
max_index = np.argmax(values)
colors = [c2 if i == max_index else c1 if i != min_index else c3 for i in range(len(values))]
wedges, texts, autotexts = ax.pie(values,
autopct='%1.1f%%',
textprops=dict(color="w"),
pctdistance=0.8,
labeldistance=1.1,
colors=colors)
# Draw a white circle at the center (for a "donut" chart)
centre_circle = plt.Circle((0,0),0.60,fc='white')
fig.gca().add_artist(centre_circle)
title = plt.title(f'Percentage of Total {sum} by {groupby} and (SUM {sum})',
pad=14, fontsize=18, **csfont, color=font_color)
ax.axis('equal')
# Annotate each sector with the corresponding segment and total_monetary
for i, p in enumerate(wedges):
ang = (p.theta2 - p.theta1)/2 + p.theta1
y = np.sin(np.deg2rad(ang))
x = np.cos(np.deg2rad(ang))
horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
connectionstyle = "angle,angleA=0,angleB={}".format(ang)
ax.annotate(f"{segments[i]} ({values[i]:.1f})", xy=(x, y),
horizontalalignment=horizontalalignment, fontsize=12)
# Change font size of percentage labels
plt.setp(autotexts, size=10, weight="bold")
plt.show()
plot_pie_chart(rfm_final, groupby='ltv_segment', sum='Monetary')
plot_pie_chart(rfm_final, groupby='Segment', sum='Monetary')
Analysis of cancelled or returned orders:
Profitability analysis:
Analysis of LTV & ROI & CAC over time:
Analysis of Retention & LTV by cohort analysis:
Customer segmentation:
Prediction of CLTV:
More detailed conclusions are placed in the corresponding sections. Summary tables have been obtained for orders, customer profiles, customer segmentation, predicted data on revenues for the next year. These data can be taken to form dashboards.
The conclusions and predictions made can be useful for budget planning, development of marketing campaigns and customer relationship management strategies.
HOME || auro2002@tut.by